Adam坤

LEARNING TO SCHEDULE COMMUNICATION IN MULTI-AGENT REINFORCEMENT LEARNING

ABSTRACT

Many real-world reinforcement learning tasks require multiple agents to make se- quential decisions under the agents’ interaction, where well-coordinated actions among the agents are crucial to achieve the target goal better at these tasks. One way to accelerate the coordination effect is to enable multiple agents to communi- cate with each other in a distributed manner and behave as a group. In this paper, we study a practical scenario when (i) the communication bandwidth is limited and (ii) the agents share the communication medium so that only a restricted num- ber of agents are able to simultaneously use the medium, as in the state-of-the-art wireless networking standards. This calls for a certain form of communication scheduling. In that regard, we propose a multi-agent deep reinforcement learn- ing framework, called SchedNet, in which agents learn how to schedule them- selves, how to encode the messages, and how to select actions based on received messages. SchedNet is capable of deciding which agents should be entitled to broadcasting their (encoded) messages, by learning the importance of each agent’s partially observed information. We evaluate SchedNet against multiple baselines under two different applications, namely, cooperative communication and naviga- tion, and predator-prey. Our experiments show a non-negligible performance gap between SchedNet and other mechanisms such as the ones without communica- tion and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.

许多现实世界的强化学习任务需要多个代理在代理的交互下做出明确的决策，代理之间协调良好的行为对于在这些任务中更好地实现目标目标至关重要。加速协调效果的一种方法是使多个代理以分布式方式相互通信并以组的形式运行。在本文中，我们研究了一个实际场景：（i）通信带宽有限，以及（ii）代理共享通信介质，以便只有有限数量的代理能够同时使用介质，如州最先进的无线网络标准。这需要某种形式的通信调度。在这方面，我们提出了一个名为SchedNet的多智能体深度强化学习框架，其中代理学习如何自我调度，如何编码消息，以及如何根据接收的消息选择动作。通过了解每个代理的部分观察信息的重要性，SchedNet能够决定哪些代理应该有权广播其（编码的）消息。我们在两种不同的应用程序（即合作通信和导航以及捕食者 - 猎物）下针对多个基线评估SchedNet。我们的实验表明，SchedNet与其他机制之间存在不可忽视的性能差距，例如没有通信的机制和vanilla调度方法，例如循环，范围从32％到43％。

1INTRODUCTION

Reinforcement Learning (RL) has garnered renewed interest in recent years. Playing the game of Go (Mnih et al., 2015), robotics control (Gu et al., 2017; Lillicrap et al., 2015), and adaptive video streaming (Mao et al., 2017) constitute just a few of the vast range of RL applications. Combined with developments in deep learning, deep reinforcement learning (Deep RL) has emerged as an accelerator in related fields. From the well-known success in single-agent deep reinforcement learn- ing, such as Mnih et al. (2015), we now witness growing interest in its multi-agent extension, the multi-agent reinforcement learning (MARL), exemplified in Gupta et al. (2017); Lowe et al. (2017); Foerster et al. (2017a); Omidshafiei et al. (2017); Foerster et al. (2016); Sukhbaatar et al. (2016); Mordatch & Abbeel (2017); Havrylov & Titov (2017); Palmer et al. (2017); Peng et al. (2017); Fo- erster et al. (2017c); Tampuu et al. (2017); Leibo et al. (2017); Foerster et al. (2017b). In the MARL problem commonly addressed in these works, multiple agents interact in a single environment re- peatedly and improve their policy iteratively by learning from observations to achieve a common goal. Of particular interest is the distinction between two lines of research: one fostering the direct communication among agents themselves, as in Foerster et al. (2016); Sukhbaatar et al. (2016) and the other coordinating their cooperative behavior without direct communication, as in Foerster et al. (2017b); Palmer et al. (2017); Leibo et al. (2017).
In this work, we concern ourselves with the former. We consider MARL scenarios wherein the task at hand is of a cooperative nature and agents are situated in a partially observable environment, but each endowed with different observation power. We formulate this scenario into a multi-agent

强化学习（RL）近年来引起了人们的兴趣。玩Go（Mnih等人，2015），机器人控制（Gu et al。，2017; Lillicrap et al。，2015）和自适应视频流（Mao et al。，2017）只是其中的一小部分。广泛的RL应用。结合深度学习的发展，深度强化学习（Deep RL）已成为相关领域的加速器。从众所周知的单一代理深度强化学习的成功，如Mnih等。（2015年），我们现在看到人们对它的多智能体扩展，多智能体强化学习（MARL）越来越感兴趣，例如Gupta等。（2017）; Lowe等人。（2017）; Foerster等。（2017A）; Omidshafiei等。（2017）; Foerster等。（2016）; Sukhbaatar等。（2016）; Mordatch＆Abbeel（2017）; Havrylov＆Titov（2017）;帕尔默等人。（2017）;彭等人。（2017）;福斯特等人。（2017c）; Tampuu等人。（2017）; Leibo等人。（2017）; Foerster等。（2017b）。在这些工作中通常涉及的MARL问题中，多个代理人在单个环境中反复交互，并通过从观察中学习来实现共同目标，从而迭代地改进其策略。特别感兴趣的是两种研究方法之间的区别：一种促进代理人自身之间的直接沟通，如Foerster等人所述。（2016）; Sukhbaatar等。（2016）和另一个协调他们的合作行为，没有直接沟通，如Foerster等人。（2017b）;帕尔默等人。（2017）; Leibo等人。（2017年）。
在这项工作中，我们关注前者。我们考虑MARL场景，其中手头的任务具有合作性质，并且代理位于部分可观察的环境中，但每个都具有不同的观察能力。我们将此场景制定为多代理

sequential decision-making problem, such that all agents share the goal of maximizing the same discounted sum of rewards. For the agents to directly communicate with each other and behave as a coordinated group rather than merely coexisting individuals, they must carefully determine the information they exchange under a practical bandwidth-limited environment and/or in the case of high-communication cost. To coordinate this exchange of messages, we adopt the centralized training and distributed execution paradigm popularized in recent works, e.g., Foerster et al. (2017a); Lowe et al. (2017); Sunehag et al. (2018); Rashid et al. (2018); Gupta et al. (2017).
In addition to bandwidth-related constraints, we take the issues of sharing the communication medium into consideration, especially when agents communicate over wireless channels. The state- of-the-art standards on wireless communication such as Wi-Fi and LTE specify the way of schedul- ing users as one of the basic functions. However, as elaborated in Related work, MARL problems involving scheduling of only a restricted set of agents have not yet been extensively studied. The key challenges in this problem are: (i) that limited bandwidth implies that agents must exchange suc- cinct information: something concise and yet meaningful and (ii) that the shared medium means that potential contenders must be appropriately arbitrated for proper collision avoidance, necessitating a certain form of communication scheduling, popularly referred to as MAC (Medium Access Control) in the area of wireless communication. While stressing the coupled nature of the encoding/decoding and the scheduling issue, we zero in on the said communication channel-based concerns and con- struct our neural network accordingly.
Contributions In this paper, we propose a new deep multi-agent reinforcement learning architec- ture, called SchedNet, with the rationale of centralized training and distributed execution in order to achieve a common goal better via decentralized cooperation. During distributed execution, agents are allowed to communicate over wireless channels where messages are broadcast to all agents in each agent’s communication range. This broadcasting feature of wireless communication necessi- tates a Medium Access Control (MAC) protocol to arbitrate contending communicators in a shared medium. CSMA (Collision Sense Multiple Access) in Wi-Fi is one such MAC protocol. While prior work on MARL to date considers only the limited bandwidth constraint, we additionally address the shared medium contention issue in what we believe is the first work of its kind: which nodes are granted access to the shared medium. Intuitively, nodes with more important observations should be chosen, for which we adopt a simple yet powerful mechanism called weight-based scheduler (WSA), designed to reconcile simplicity in training with integrity of reflecting real-world MAC protocols in use (e.g., 802.11 Wi-Fi). We evaluate SchedNet for two applications: cooperative communication and navigation and predator/prey and demonstrate that SchedNet outperforms other baseline mech- anisms such as the one without any communication or with a simple scheduling mechanism such as round robin. We comment that SchedNet is not intended for competing with other algorithms for cooperative multi-agent tasks without considering scheduling, but a complementary one. We believe that adding our idea of agent scheduling makes those algorithms much more practical and valuable.
顺序决策问题，使所有代理人共享最大化相同折扣的奖励总额的目标。为了使代理彼此直接通信并且作为协调组而不仅仅是共存的个体，他们必须仔细地确定他们在实际带宽有限的环境下和/或在高通信成本的情况下交换的信息。为了协调这种信息交换，我们采用了近期工作中普及的集中训练和分布式执行范式，例如Foerster等。（2017A）; Lowe等人。（2017）; Sunehag等人。（2018）;拉希德等人。（2018）;古普塔等人。（2017年）。
除了与带宽相关的约束之外，我们还考虑了共享通信介质的问题，尤其是当代理通过无线信道进行通信时。 Wi-Fi和LTE等无线通信的最新标准规定了将用户作为基本功能之一的方式。但是，正如相关工作中所阐述的那样，尚未广泛研究涉及仅限制一组有限代理人的MARL问题。这个问题的主要挑战是：（i）有限的带宽意味着代理商必须交换成功的信息：简洁而有意义的事情;（ii）共享媒体意味着潜在的竞争者必须适当地进行仲裁，以确保适当的避免碰撞，需要某种形式的通信调度，在无线通信领域通常称为MAC（媒体访问控制）。在强调编码/解码和调度问题的耦合性质的同时，我们依赖于所述基于通信信道的关注点并相应地构建我们的神经网络。
贡献在本文中，我们提出了一个新的深度多智能体强化学习架构，称为SchedNet，其基本原理是集中训练和分布式执行，以便通过分散协作更好地实现共同目标。在分布式执行期间，允许代理通过无线信道进行通信，其中消息被广播到每个代理的通信范围中的所有代理。这种无线通信的广播特征需要媒体访问控制（MAC）协议来仲裁共享媒体中的竞争通信器。 Wi-Fi中的CSMA（冲突感知多路访问）就是这样一种MAC协议。虽然迄今为止MARL的先前工作仅考虑有限带宽约束，但我们还在我们认为是同类中的第一项工作中解决了共享媒体争用问题：哪些节点被授权访问共享媒体。直观地，应该选择具有更重要观察结果的节点，为此我们采用称为基于权重的调度程序（WSA）的简单而强大的机制，旨在协调训练中的简单性与反映使用中的现实MAC协议的完整性（例如，802.11）无线上网）。我们针对两种应用评估SchedNet：协作通信和导航以及捕食者/猎物，并证明SchedNet优于其他基线机制，例如没有任何通信的机制或者诸如循环的简单调度机制。我们评论说SchedNet不是为了与其他算法竞争协作多代理任务而不考虑调度，而是一个互补的调度。我们相信，添加我们的代理调度思想会使这些算法更加实用和有价值。
Related work We now discuss the body of relevant literature. Busoniu et al. (2008) and Tan (1993) have studied MARL with decentralized execution extensively. However, these are based on tabular methods so that they are restricted to simple environments. Combined with developments in deep learning, deep MARL algorithms have emerged (Tampuu et al., 2017; Foerster et al., 2017a; Lowe et al., 2017). Tampuu et al. (2017) uses a combination of DQN with independent Q-learning. This independent learning does not perform well because each agent considers the others as a part of envi- ronment and ignores them. Foerster et al. (2017a); Lowe et al. (2017); Gupta et al. (2017); Sunehag et al. (2018), and Foerster et al. (2017b) adopt the framework of centralized training with decen- tralized execution, empowering the agent to learn cooperative behavior considering other agents’ policies without any communication in distributed execution.
It is widely accepted that communication can further enhance the collective intelligence of learning agents in their attempt to complete cooperative tasks. To this end, a number of papers have previ- ously studied the learning of communication protocols and languages to use among multiple agents in reinforcement learning. We explore those bearing the closest resemblance to our research. Foer- ster et al. (2016); Sukhbaatar et al. (2016); Peng et al. (2017); Guestrin et al. (2002), and Zhang & Lesser (2013) train multiple agents to learn a communication protocol, and have shown that commu- nicating agents achieve better rewards at various tasks. Mordatch & Abbeel (2017) and Havrylov & Titov (2017) investigate the possibility of the artificial emergence of language. Coordinated RL by Guestrin et al. (2002) is an earlier work demonstrating the feasibility of structured communication and the agents’ selection of jointly optimal action.

Only DIAL (Foerster et al., 2016) and Zhang & Lesser (2013) explicitly address bandwidth-related concerns. In DIAL, the communication channel of the training environment has a limited bandwidth, such that the agents being trained are urged to establish more resource-efficient communication pro- tocols. The environment in Zhang & Lesser (2013) also has a limited-bandwidth channel in effect, due to the large amount of exchanged information in running a distributed constraint optimization algorithm. Recently, Jiang & Lu (2018) proposes an attentional communication model that allows some agents who request additional information from others to gather observation from neighboring agents. However, they do not explicitly consider the constraints imposed by limited communication bandwidth and/or scheduling due to communication over a shared medium.
To the best of our knowledge, there is no prior work that incorporates an intelligent scheduling entity in order to facilitate inter-agent communication in both a limited-bandwidth and shared medium access scenarios. As outlined in the introduction, intelligent scheduling among learning agents is pivotal in the orchestration of their communication to better utilize the limited available bandwidth as well as in the arbitration of agents contending for shared medium access.

相关工作我们现在讨论相关文献的主体。 Busoniu等。（2008）和Tan（1993）对MARL进行了广泛的分散执行研究。但是，这些都基于表格方法，因此它们仅限于简单的环境。结合深度学习的发展，已经出现了深度MARL算法（Tampuu等人，2017; Foerster等人，2017a; Lowe等人，2017）。 Tampuu等人。（2017）使用DQN与独立Q学习的组合。这种独立学习表现不佳，因为每个代理人都将其他人视为环境的一部分而忽略了它们。 Foerster等。（2017A）; Lowe等人。（2017）;古普塔等人。（2017）; Sunehag等人。（2018年），和Foerster等人。（2017b）采用分散执行的集中培训框架，使代理能够在没有任何通信的情况下考虑其他代理的策略来学习合作行为。
人们普遍认为，沟通可以进一步增强学习代理人在完成合作任务时的集体智慧。为此，许多论文先前已经研究了通信协议和语言的学习，以便在强化学习中使用多个代理。我们探索与我们的研究最相似的那些。 Foer-ster等。（2016）; Sukhbaatar等。（2016）;彭等人。（2017）; Guestrin等。（2002）和Zhang＆Lesser（2013）训练多个代理来学习通信协议，并且已经表明通信代理在各种任务中获得更好的奖励。 Mordatch＆Abbeel（2017）和Havrylov＆Titov（2017）研究了人工出现语言的可能性。由Guestrin等人协调的RL。（2002）是一个早期的工作，展示了结构化沟通的可行性和代理人选择联合最优行动。

只有DIAL（Foerster等，2016）和Zhang＆Lesser（2013）明确解决了与带宽相关的问题。在DIAL中，训练环境的通信信道具有有限的带宽，从而促使正在训练的代理建立更加资源有效的通信协议。由于在运行分布式约束优化算法时交换了大量信息，因此Zhang＆Lesser（2013）中的环境也具有有限带宽信道。最近，Jiang＆Lu（2018）提出了一种注意沟通模型，该模型允许一些要求其他人提供额外信息的代理人收集邻近代理商的观察。然而，由于通过共享介质进行通信，它们没有明确考虑由有限的通信带宽和/或调度所施加的约束。
据我们所知，没有先前的工作包含智能调度实体，以便在有限带宽和共享媒体访问场景中促进代理间通信。如引言中所述，学习代理之间的智能调度在其通信编排中是关键的，以更好地利用有限的可用带宽以及争用共享媒体访问的代理的仲裁。

that selects i’s action based only on what is partially observed by i. The critic is naturally responsible for centralized training, and thus works in a centralized manner. Thus, the critic is allowed to have the global state s as its input, which includes all agents’ observations and extra information from the environment. The role of the critic is to “criticize” individual agent’s actions. This centralized nature of the critic helps in providing more accurate feedback to the individual actors with limited observation horizon. In this case, each agent’s policy, πi, is updated by a variant of (1) as:
只根据i部分观察到的内容来选择i的动作。评论家自然负责集中培训，因此集中工作。因此，批评者可以将全球国家作为其输入，其中包括所有代理人的观察和来自环境的额外信息。评论家的作用是“批评”个体经纪人的行为。评论家的这种集中性有助于为观察范围有限的个体演员提供更准确的反馈。在这种情况下，每个代理的策略πi由（1）的变体更新为：

3METHOD

3.1COMMUNICATION ENVIRONMENT AND PROBLEM

In practical scenarios where agents are typically separated but are able to communicate over a shared medium, e.g., a frequency channel in wireless communications, two important constraints are im- posed: bandwidth and contention for medium access (Rappaport, 2001). The bandwidth constraint entails a limited amount of bits per unit time, and the contention constraint involves having to avoid collision among multiple transmissions due to the natural aspect of signal broadcasting in wireless communication. Thus, only a restricted number of agents are allowed to transmit their messages each time step for a reliable message transfer. In this paper, we use a simple model to incorporate that the aggregate information size per time step is limited by Lband bits and that only Ksched out of n agents may broadcast their messages.
Weight-based Scheduling Noting that distributed execution of agents is of significant importance, there may exist a variety of scheduling mechanisms to schedule Ksched agents in a distributed man- ner. In this paper, we adopt a simple algorithm that is weight-based, which we call WSA (Weight- based Scheduling Algorithm). Once each agent decides its own weight, the agents are scheduled based on their weights following a class of the pre-defined rules. We consider the following two specific ones among many different proposals due to simplicity, but more importantly, good approx- imation of wireless scheduling protocols in practice.

在通常分离代理但能够通过共享介质（例如，无线通信中的频率信道）进行通信的实际场景中，提出了两个重要的约束：带宽和媒体访问的争用（Rappaport，2001）。带宽约束需要每单位时间有限的比特量，并且争用约束涉及由于无线通信中的信号广播的自然方面而必须避免多个传输之间的冲突。因此，每次只允许有限数量的代理发送其消息以进行可靠的消息传送。在本文中，我们使用一个简单的模型来合并每个时间步的聚合信息大小受Lband比特的限制，并且只有N个代理中的Ksched可以广播它们的消息。
基于权重的调度注意到代理的分布式执行非常重要，可能存在各种调度机制来以分布式方式调度Ksched代理。在本文中，我们采用一种基于权重的简单算法，我们称之为WSA（基于权重的调度算法）。一旦每个代理确定其自身权重，就根据一类预定义规则的代码来调度代理。由于简单性，我们在许多不同的提议中考虑以下两个具体的，但更重要的是，在实践中很好地近似了无线调度协议。

Since distributed execution is one of our major operational constraints in SchedNet or other CTDE- based MARL algorithms, Top(k) and Softmax(k) should be realizable via a weight-based mech- anism in a distributed manner. In fact, this has been an active research topic to date in wireless networking, where many algorithms exist (Tassiulas & Ephremides, 1992; Yi et al., 2008; Jiang & Walrand, 2010). Due to space limitation, we present how to obtain distributed versions of those two rules based on weights in our supplementary material. To summarize, using so-called CSMA (Car- rier Sense Multiple Access) (Kurose, 2005), which is a fully distributed MAC scheduler and forms a basis of Wi-Fi, given agents’ weight values, it is possible to implement Top(k) and Softmax(k).
Our goal is to train agents so that every time each agent takes an action, only Ksched agents can broadcast their messages with limited size Lband with the goal of receiving the highest cumulative reward via cooperation. Each agent should determine a policy described by its scheduling weights, encoded communication messages, and actions.

由于分布式执行是我们在SchedNet或其他基于CTDE的MARL算法中的主要操作约束之一，因此Top（k）和Softmax（k）应该通过基于权重的机制以分布式方式实现。事实上，迄今为止，这一直是无线网络中一个活跃的研究课题，其中存在许多算法（Tassiulas＆Ephremides，1992; Yi et al。，2008; Jiang＆Walrand，2010）。由于篇幅限制，我们将介绍如何根据补充材料中的权重获取这两个规则的分布式版本。总而言之，使用所谓的CSMA（Carrier Sense Multiple Access）（Kurose，2005），它是一个完全分布式的MAC调度程序并构成Wi-Fi的基础，给定代理的权重值，可以实现Top （k）和Softmax（k）。
我们的目标是培训代理商，以便每次代理商采取行动时，只有Ksched代理商可以用有限大小的Lband广播他们的消息，目的是通过合作获得最高的累积奖励。每个代理应确定由其调度权重，编码通信消息和动作描述的策略。

3.2ARCHITECTURE

To this end, we propose a new deep MARL framework with scheduled communications, called SchedNet, whose overall architecture is depicted in Figure 1. SchedNet consists of the following three components: (i) actor network, (ii) scheduler, and (iii) critic network. This section is devoted to presenting the architecture only, whose details are presented in the subsequent sections.
Neural networks The actor network is the collection of n per-agent individual actor networks, where each agent i’s individual actor network consists of a triple of the following networks: a

为此，我们提出了一个新的深度MARL框架，其中包含计划通信，称为SchedNet，其整体架构如图1所示.SchedNet由以下三个部分组成：（i）演员网络，（ii）调度程序，以及（iii）评论家网络。本节仅用于介绍该体系结构，其详细信息将在后续章节中介绍。
神经网络参与者网络是n个每个代理个体参与者网络的集合，其中每个代理i的个体参与者网络由以下网络的三个组成：a

Coupling: Actor and Scheduler Encoder, weight generator and the scheduler are the modules for handling the constraints of limited bandwidth and shared medium access. Their common goal is to learn the state-dependent “importance” of individual agent’s observation, encoders for generating compressed messages and the scheduler for being used as a basis of an external scheduling mech- anism based on the weights generated by per-agent weight generators. These three modules work together to smartly respond to time-varying states. The action selector is trained to decode the in- coming message, and consequently, to take a good action for maximizing the reward. At every time step, the schedule profile c varies depending on the observation of each agent, so the incoming mes- sage m comes from a different combination of agents. Since the agents can be heterogeneous and they have their own encoder, the action selector must be able to make sense of incoming messages from different senders. However, the weight generator’s policy changes, the distribution of incoming messages also changes, which is in turn affected by the pre-defined WSA. Thus, the action selector should adjust to this changed scheduling. This also affects the encoder in turn. The updates of the encoder and the action selector trigger the update of the scheduler again. Hence, weight generators, message encoders, and action selectors are strongly coupled with dependence on a specific WSA, and we train those three networks at the same time with a common critic.
Scheduling logic The schedule profile c is determined by the WSA module, which is mathemat-
ically a mapping from all agents’ weights w (generated by fi ) to c. Typical examples of these
mappings are Top(k) and Softmax(k), as mentioned above. The scheduler of each agent is trained appropriately depending on the employed WSA algorithm.
耦合：Actor和Scheduler Encoder，权重生成器和调度程序是用于处理有限带宽和共享介质访问的约束的模块。他们的共同目标是学习个体代理观察的状态依赖“重要性”，用于生成压缩消息的编码器以及用作基于每个代理权重生成器生成的权重的外部调度机制的基础的调度器。这三个模块协同工作，巧妙地响应时变状态。操作选择器经过训练以解码收到的消息，从而采取良好的行动以最大化奖励。在每个时间步，计划配置文件c根据每个代理的观察而变化，因此传入消息m来自不同的代理组合。由于代理可以是异构的并且它们具有自己的编码器，因此动作选择器必须能够理解来自不同发送者的传入消息。但是，权重生成器的策略发生变化，传入消息的分布也会发生变化，而这又受到预定义WSA的影响。因此，动作选择器应该适应这种改变的调度。这也会依次影响编码器。编码器和动作选择器的更新再次触发调度程序的更新。因此，权重生成器，消息编码器和动作选择器与对特定WSA的依赖强烈耦合，我们同时用一个共同的批评者训练这三个网络。
调度逻辑调度配置文件c由WSA模块确定，该模块是数学模块
从所有代理的权重w（由fi生成）到c的映射。这些的典型例子
如上所述，映射是Top（k）和Softmax（k）。根据所采用的WSA算法适当地训练每个代理的调度程序。

parametrized by θc to estimate the state value function Vθc (s) for the action selectors and message encoders, and the action-value function Qπ (s, w) for the weight generators. The critic is used only when training, and it can use the global state s, which includes the observation of all agents. All net- works in the actor are trained with gradient-based on temporal difference backups. To share common features between Vθc (s) and Qπ (s, w) and perform efficient training, we use shared parameters in the lower layers of the neural network between the two functions, as shown in Figure 2.

由θc参数化以估计动作选择器和消息编码器的状态值函数Vθc（s），以及权重发生器的动作值函数Qπ（s，w）。批评者只在训练时使用，它可以使用全局状态，包括观察所有代理人。演员中的所有网络都使用基于时间差异备份的梯度进行训练。为了分享Vθc（s）和Qπ（s，w）之间的共同特征并进行有效的训练，我们在两个函数之间的神经网络的较低层使用共享参数，如图2所示。

where s and sj are the global states corresponding to the observations at current and next time step. We can get the value of state Vθc (s) from the centralized critic and then adjust the parameters θu via
gradient ascent accordingly.

其中s和sj是与当前和下一步的观察结果相对应的全局状态。我们可以从集中评论家得到状态Vθc（s）的值，然后通过调整参数θu
因此梯度上升。

In execution, each agent i should be able to determine the scheduling weight wi, encoded message mi, and action selection ui in a distributed manner. This process must be based on its own obser- vation, and the weights generated by its own action selector, message encoder, and weight genera-
tor with the parameters θi , θi , and θi , respectively. After each agent determines its scheduling
as enc wg
weight, Ksched agents are scheduled by WSA, which leads the encoded messages of scheduled agents to be broadcast to all agents. Finally, each agent finally selects an action by using received messages. This process is sequentially repeated under different observations over time.

在执行中，每个代理i应该能够以分布式方式确定调度权重wi，编码消息mi和动作选择ui。此过程必须基于其自身的观察，以及由其自己的动作选择器，消息编码器和权重生成的权重 -
分别具有参数θi，θi和θi。每个代理确定其调度后
作为ENC WG
权重，Ksched代理由WSA安排，它将预定代理的编码消息导向所有代理。最后，每个代理最终通过使用收到的消息选择一个动作。随着时间的推移，在不同的观察下依次重复该过程

4 EXPERIMENT

Environments To evaluate SchedNet2, we consider two different environments for demonstrative purposes: Predator and Prey (PP) which is used in Stone & Veloso (2000), and Cooperative Com- munication and Navigation (CCN) which is the simplified version of the one in Lowe et al. (2017). The detailed experimental environments are elaborated in the following subsections as well as in supplementary material. We take the communication environment into our consideration as follows. k out of all agents can have the chance to broadcast the message whose bandwidth3 is limited by l.
Tested algorithms and setup We perform experiments in aforementioned environments. We com- pare SchedNet with a variant of DIAL,4 (Foerster et al., 2016) which allows communication with limited bandwidth. During the execution of DIAL, the limited number (k) of agents are scheduled following a simple round robin scheduling algorithm, and the agent reuses the outdated messages of non-scheduled agents to make a decision on the action to take, which is called DIAL(k). The other baselines are independent DQN (IDQN) (Tampuu et al., 2017) and COMA (Foerster et al., 2017a) in which no agent is allowed to communicate. To see the impact of scheduling in SchedNet, we

环境为了评估SchedNet2，我们考虑两种不同的环境用于演示目的：用于Stone＆Veloso（2000）的Predator和Prey（PP），以及合作通信和导航（CCN）的简化版本。 Lowe等人。（2017年）。详细的实验环境将在以下小节和补充材料中详细说明。我们将通信环境考虑如下。所有代理中的k都有机会广播其bandwidth3受l限制的消息。
经过测试的算法和设置我们在上述环境中进行实验。我们将SchedNet与DIAL的变体4相比较（Foerster等，2016），它允许以有限的带宽进行通信。在执行DIAL期间，有限数量（k）的代理按照简单的循环调度算法进行调度，并且代理重新使用非调度代理的过时消息来决定要采取的操作，这称为DIAL （K）。其他基线是独立的DQN（IDQN）（Tampuu等，2017）和COMA（Foerster等，2017a），其中不允许代理进行通信。为了了解SchedNet中调度的影响，我们

Figure 3: Learning curves during the learning of the PP and CCN tasks. The plots show the average time taken to complete the task, where shorter time is better for the agents.

compare SchedNet with (i) RR (round robin), which is a canonical scheduling method in communi- cation systems where all agents are sequentially scheduled, and (ii) FC (full communication), which is the ideal configuration, wherein all the agents can send their messages without any scheduling or bandwidth constraints. We also diversify the WSA in SchedNet into: (i) Sched-Softmax(1) and
(ii) Sched-Top(1) whose details are in Section 3.1. We train our models until convergence, and then evaluate them by averaging metrics for 1,000 iterations. The shaded area in each plot denotes 95% confidence intervals based on 6-10 runs with different seeds.

图3：学习PP和CCN任务期间的学习曲线。这些图显示了完成任务所需的平均时间，其中较短的时间对于代理来说更好。

比较SchedNet与（i）RR（循环法），这是在所有代理按顺序安排的通信系统中的规范调度方法，以及（ii）FC（完全通信），这是理想的配置，其中所有代理无需任何调度或带宽限制即可发送消息。我们还将SchedNet中的WSA多样化为：（i）Sched-Softmax（1）和
（ii）Sched-Top（1），详情见3.1节。我们训练模型直到收敛，然后通过平均1000次迭代的度量来评估它们。每个图中的阴影区域表示基于具有不同种子的6-10次运行的95％置信区间。

4.1PREDATOR AND PREY

In this task, there are multiple agents who must capture a randomly moving prey. Agents’ observa- tions include position of themselves and the relative positions of prey, if observed. We employ four agents, and they have different observation horizons, where only agent 1 has a 5 5 view while agents 2, 3, and 4 have a smaller, 3 3 view. The predators are rewarded when they capture the prey, and thus the performance metric is the number of time steps taken to capture the prey.
Result in PP Figure 3a illustrates the learning curve of 750,000 steps in PP. In FC, since the agents can use full state information even during execution, they achieve the best performance. SchedNet outperforms IDQN and COMA in which communication is not allowed. It is observed that agents first find the prey, and then follow it until all other agents also eventually observe the prey. An agent successfully learns to follow the prey after it observes the prey but that it takes a long time to meet the prey for the first time. If the agent broadcasts a message that includes the location information of the prey, then other agents can find the prey more quickly. Thus, it is natural that SchedNet and DIAL perform better than IDQN or COMA, because they are trained to work with communication. However, DIAL is not trained for working under medium contention constraints. Although DIAL works well when there is no contention constraints, under the condition where only one agent is scheduled to broadcast the message by a simple scheduling algorithm (i.e., RR), the average number of steps to capture the prey in DIAL(1) is larger than that of SchedNet-Top(1), because the outdated messages of non-scheduled agents is noisy for the agents to decide on actions. Thus, we should consider the scheduling from when we train the agents to make them work in a demanding environment.
Impact of intelligent scheduling In Figure 3b, we observe that IDQN, RR, and SchedNet- Softmax(1) lie more or less on a comparable performance tier, with SchedNet-Softmax(1) as the best in the tier. SchedNet-Top(1) demonstrates a non-negligible gap better than the said tier, im- plying that a deterministic selection improves the agents’ collective rewards the best. In particular, SchedNet-Top(1) improves the performance by 43% compared to RR. Figure 3b lets us infer that, while all the agents are trained under the same conditions except for the scheduler, the difference in the scheduler is the sole determining factor for the variation in the performance levels. Thus, ablating away the benefit from smart encoding, the intelligent scheduling element in SchedNet can be accredited with the better performance.
Weight-based Scheduling We attempt to explain the internal behavior of SchedNet by investi- gating instances of temporal scheduling profiles obtained during the execution. We observe that SchedNet has learned to schedule those agents with a farther observation horizon, realizing the ra- tionale of importance-based assignment of scheduling priority also for the PP scenario. Recall that Agent 1 has a wider view and thus tends to obtain valuable observation more frequently. In Figure 4, we see that scheduling chances are distributed over (14, 3, 4, 4) where corresponding average weights are (0.74, 0.27, 0.26, 0.26), implying that those with greater observation power tend to be scheduled more often.

在此任务中，有多个代理必须捕获随机移动的猎物。代理人的观察包括他们自己的位置和猎物的相对位置，如果观察到的话。我们使用了四个代理，它们具有不同的观察视野，其中只有代理1具有5 5视图而代理2,3和4具有较小的视图3 3视图。掠食者在捕获猎物时会得到奖励，因此性能指标是捕获猎物所需的时间步数。
** PP中的结果**图3a显示了PP中750,000步的学习曲线。在FC中，由于代理可以在执行期间使用完整的状态信息，因此它们可以实现最佳性能。 SchedNet优于不允许通信的IDQN和COMA。据观察，试剂首先找到猎物，然后跟随它直到所有其他试剂最终都观察到猎物。一个代理人在观察猎物后成功学会跟踪猎物，但是第一次遇到猎物需要很长时间。如果代理广播包含猎物的位置信息的消息，则其他代理可以更快地找到猎物。因此，SchedNet和DIAL自然比IDQN或COMA表现更好，因为它们经过培训可以与通信协同工作。但是，DIAL没有接受过在中等争用限制下工作的培训。虽然DIAL在没有争用约束的情况下运行良好，但在只安排一个代理通过简单的调度算法（即RR）广播消息的情况下，在DIAL（1）中捕获猎物的平均步数是大于SchedNet-Top（1），因为非调度代理的过时消息对于代理决定操作是有噪声的。因此，我们应该考虑从培训代理商到让他们在苛刻的环境中工作的时间安排。
智能调度的影响在图3b中，我们观察到IDQN，RR和SchedNet-Softmax（1）或多或少地位于可比较的性能层上，SchedNet-Softmax（1）是层中最好的。 SchedNet-Top（1）表现出比上述层次更好的差距，表明确定性选择最能提高代理商的集体回报。特别是，与RR相比，SchedNet-Top（1）将性能提高了43％。图3b让我们推断，虽然除了调度程序之外，所有代理都在相同条件下进行训练，但调度程序中的差异是性能级别变化的唯一决定因素。因此，消除智能编码带来的好处，SchedNet中的智能调度元素可以获得更好的性能认证。
基于权重的调度我们尝试通过调查执行期间获得的时间调度配置文件的实例来解释SchedNet的内部行为。我们观察到SchedNet已经学会了在更远的观察范围内安排这些代理，同时也为PP场景实现了基于重要性的调度优先级分配的概率。回想一下，Agent 1具有更广泛的视野，因此更容易获得有价值的观察。在图4中，我们看到调度机会分布在（14,3,4,4），其中相应的平均权重为（0.74,0.27,0.26,0.26），这意味着具有更高观察能力的那些往往更频繁地被安排。

Message encoding We now attempt to understand what the predator agents communicate when performing the task. Figure 5 shows the projections of the messages onto a 2D plane, which is generated by the scheduled agent under SchedNet Top(1) with l = 2. When the agent does not observe the prey (blue circle in Figure), most of the messages reside in the bottom or the left partition of the plot. On the other hand, the messages have large variance when it observes the prey (red ‘x’). This is because the agent should transfer more informative messages that implicitly include the location of the prey, when it observes the prey. Further analysis of the messages is presented in our supplementary material.

消息编码我们现在尝试了解执行任务时捕食者代理的通信。图5显示了消息在2D平面上的投影，这是由SchedNet Top（1）下的预定代理生成的，其中l = 2.当代理没有观察到猎物时（图中的蓝色圆圈），大多数消息位于图的底部或左侧分区。另一方面，当观察到猎物（红色’x’）时，消息具有很大的差异。这是因为当代理观察猎物时，代理应该传递更多信息性消息，这些消息隐含地包括猎物的位置。我们的补充材料中提供了对这些信息的进一步分析。

4.2COOPERATIVE COMMUNICATION AND NAVIGATION

In this task, each agent’s goal is to arrive at a pre-specified destination on its one-dimensional world, and they collect a joint reward when both agents reach their respective destina- tion. Each agent has a zero observation horizon around itself, but it can observe the situation of the other agent. We introduce heterogeneity into the scenario, where the agent-destination distance at the beginning of the task differs across agents. The metric used to gauge the performance is the number of time steps taken to complete the CCN task.
Result in CCN We examine the CCN environment whose results are shown in Figure 3c. Sched- Net and other baselines were trained for 200,000 steps. As expected, IDQN takes the longest time, and FC takes the shortest time. RR exhibits mediocre performance, better than IDQN, because agents at least take turns in obtaining the communication opportunity. Of particular interest is Sched- Net, outperforming both IDQN and RR with a non-negligible gap. We remark that the deterministic selection with SchedNet-Top(1) slightly beats the probabilistic counterpart, SchedNet-Softmax(1). The 32% improved gap between RR and SchedNet clearly portrays the effects of intelligent schedul- ing, as the carefully learned scheduling method of SchedNet was shown to complete the CCN task faster than the simplistic RR.

Scheduling in CCN As Agent 2 is farther from its destination than Agent 1, we observe that Agent 1 is scheduled more frequently to drive Agent 2 to its destination (7 vs. 18), as shown in Figure 6. This evidences that SchedNet flex- ibly adapts to heterogeneity of agents via scheduling. To- wards more efficient completion of the task, a rationale of more scheduling for more important agents should be implemented. This is in accordance with the results obtained from PP environments: more important agents are scheduled more.

在此任务中，每个代理的目标是在其一维世界中到达预先指定的目的地，并且当两个代理到达其各自的目的地时，他们收集联合奖励。每个代理都有自己的零观察范围，但它可以观察其他代理的情况。我们在场景中引入异构性，其中任务开始时的代理 - 目标距离因代理而异。用于衡量性能的指标是完成CCN任务所花费的时间步数。
结果CCN我们检查CCN环境，其结果如图3c所示。 Sched-Net和其他基线经过200,000步骤的培训。正如预期的那样，IDQN花费的时间最长，FC花费的时间最短。 RR表现出平庸的表现，优于IDQN，因为代理商至少轮流获得沟通机会。特别感兴趣的是Sched-Net，其表现优于IDQN和RR，且差距不可忽略。我们注意到SchedNet-Top（1）的确定性选择略微超过了概率对应物SchedNet-Softmax（1）。 RR与SchedNet之间的差距缩小32％，清楚地描述了智能调度的影响，因为SchedNet的精心学习的调度方法比简单的RR更快地完成了CCN任务。

CCN中的调度由于代理2距离其目的地比代理1更远，我们观察到代理1被更频繁地调度以将代理2驱动到其目的地（7对18），如图6所示。该证据SchedNet通过调度灵活地适应代理的异构性。为了更有效地完成任务，应该为更重要的代理实现更多调度的基本原理。这与PP环境的结果一致：更重要的代理商安排得更多。

5CONCLUSION

We have proposed SchedNet for learning to schedule inter-agent communications in fully- cooperative multi-agent tasks. In SchedNet, we have the centralized critic giving feedback to the actor, which consists of message encoders, action selectors, and weight generators of each individ- ual agent. The message encoders and action selectors are criticized towards compressing observa- tions more efficiently and selecting actions that are more rewarding in view of the cooperative task at hand. Meanwhile, the weight generators are criticized such that k agents with apparently more valuable observation are allowed to access the shared medium and broadcast their messages to all other agents. Empirical results and an accompanying ablation study indicate that the learnt encoding and scheduling behavior each significantly improve the agents’ performance. We have observed that an intelligent, distributed communication scheduling can aid in a more efficient, coordinated, and rewarding behavior of learning agents in the MARL setting.

我们已经提出SchedNet用于学习在完全协作的多代理任务中调度代理间通信。在SchedNet中，我们有集中的评论家向演员提供反馈，演员由每个个体代理的消息编码器，动作选择器和权重生成器组成。消息编码器和动作选择器被批评为更有效地压缩观察并选择在手头的合作任务中更有价值的行动。同时，权重发生器受到批评，使得具有明显更有价值观察的k个代理被允许访问共享媒体并将其消息广播给所有其他代理。经验结果和伴随的消融研究表明，学习的编码和调度行为每个都显着改善了代理的表现。我们已经观察到，智能的分布式通信调度可以在MARL设置中帮助学习代理的更有效，协调和有益的行为。

ACKNOWLEDGE

This work was supported by Institute for Information communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No.2018-0-00170, Virtual Presence in Moving Ob- jects through 5G)
这项工作得到了由韩国政府（MSIT）资助的信息通信技术促进研究所（IITP）资助（No.2018-0-00170，通过5G移动对象的虚拟存在）
REFERENCES
Herve´ Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4-5):291–294, 1988.
Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, And Cybernetics, 2008.
Adam Dunkels, Bjorn Gronvall, and Thiemo Voigt. Contiki-a lightweight and flexible operating system for tiny networked sensors. In Proceedings of Local Computer Networks (LCN), 2004.
Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Proceedings of Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017a.
Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Philip Torr, Pushmeet Kohli, Shimon Whiteson, et al. Stabilising experience replay for deep multi-agent reinforcement learning. arXiv preprint arXiv:1702.08887, 2017b.
Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arxiv preprint arXiv:1709.04326, 2017c.
Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017.
Carlos Guestrin, Michail Lagoudakis, and R Parr. Coordinated reinforcement learning. In Proceed- ings of International Conference on Machine Learning, 2002.
Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, 2017.
Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: learning to com- municate with sequences of symbols. In Proceedings of Advances in Neural Information Process- ing Systems, pp. 2146–2156, 2017.
Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Proceedings of Advances in Neural Information Processing Systems, pp. 3–10, 1994.
Hyeryung Jang, Se-Young Yun, Jinwoo Shin, and Yung Yi. Distributed learning for utility max- imization over csma-based wireless multihop networks. In Proceedings of IEEE INFOCOM, 2014.
Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation.
arXiv preprint arXiv:1805.07733, 2018.
Libin Jiang and Jean Walrand. A distributed csma algorithm for throughput and utility maximization in wireless networks. IEEE/ACM Transactions on Networking (ToN), 18(3):960–972, 2010.
Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. Control Optimization, 42(4):1143– 1166, 2003.

James F Kurose. Computer networking: A top-down approach featuring the internet. Pearson Education India, 2005.
Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of International Conference on Autonomous Agents and MultiAgent Systems, pp. 464–473, 2017.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor- critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.
Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with pensieve. In Proceedings of ACM Sigcomm, pp. 197–210, 2017.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of International Conference on Machine Learning, pp. 1928–1937, 2016.
Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. arxiv preprint arXiv:1703.10069, 2017.
Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016.
Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. Deep decentralized multi-task multi-agent RL under partial observability. arXiv preprint arXiv:1703.06182, 2017.
Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep rein- forcement learning. arxiv preprint arXiv:1707.04402, 2017.
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arxiv preprint arXiv:1703.10069, 2017.
Theodore Rappaport. Wireless Communications: Principles and Practice. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2nd edition, 2001. ISBN 0130422320.
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent rein- forcement learning. In Proceedings of International Conference on Machine Learning, 2018.
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of International Conference on Machine Learning, 2014.
Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspec- tive. Autonomous Robots, 8(3):345–383, 2000.
Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropaga- tion. In Proceedings of Advances in Neural Information Processing Systems, pp. 2244–2252, 2016.
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vin´ıcius Flores Zam- baldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning based on team re- ward. In Proceeding of AAMAS, 2018.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cam- bridge, 1998.
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of Advances in Neural Information Processing Systems, 2000.
Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.
Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of International Conference on Machine Learning, pp. 330–337, 1993.
Leandros Tassiulas and Anthony Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE transactions on automatic control, 37(12):1936–1948, 1992.
Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3): 58–68, 1995.
Yung Yi, Alexandre Proutie`re, and Mung Chiang. Complexity in wireless scheduling: Impact and tradeoffs. In Proceedings of ACM Mobihoc, 2008.
Chongjie Zhang and Victor Lesser. Coordinating multi-agent reinforcement learning with limited communication. In Proceedings of International conference on Autonomous agents and multi- agent systems, 2013.

SUPPLEMENTARY MATERIAL

ASCHEDNET TRAINING ALGORITHM

The training algorithm for SchedNet is provided in Algorithm 1. The parameters of the message

B.1ENVIRONMENTS: PP AND CCN

Predator and prey We assess SchedNet in this predator-prey setting as in Stone & Veloso (2000), illustrated in Figure 7a. This setting involves a discretized grid world and multiple cooperating predators who must capture a randomly moving prey. Agents’ observations include position of themselves and the relative positions of the prey, if observed. The observation horizon of each predator is limited, thereby emphasizing the need for communication. The termination criterion for the task is that all agents observe the prey, as in the right of Figure 7a. The predators are rewarded when the task is terminated. We note that agents may be endowed with different observation hori- zons, making them heterogeneous. We employ four agents in our experiment, where only agent 1 has a 5 5 view while agents 2, 3, and 4 have a smaller, 3 3 view. The performance metric is the number of time steps taken to capture the prey.

Cooperative communication and navigation We adopt and modify the cooperative communica- tion and navigation task in Lowe et al. (2017), where we test SchedNet in a simple one-dimensional grid as in Figure 7b. In CCN, each of the two agents resides in its one-dimensional grid world. Each agent’s goal is to arrive at a pre-specified destination (denoted by the square with a star or a heart for Agents 1 and 2, respectively), and they collect a joint reward when both agents reach their target destination. Each agent has a zero observation horizon around itself, but it can observe the situation of the other agent. We introduce heterogeneity into the scenario, where the agent-destination dis- tance at the beginning of the task differs across agents. In our example, Agent 2 is initially located at a farther place from its destination, as illustrated in Figure 7b. The metric used to gauge the performance of SchedNet is the number of time steps taken to complete the CCN task.

捕食者和猎物我们在Stone＆Veloso（2000）中对捕食者 - 猎物设置中的SchedNet进行评估，如图7a所示。此设置涉及离散网格世界和多个合作捕食者，他们必须捕获随机移动的猎物。如果观察到，特工的观察包括他们自己的位置和猎物的相对位置。每个捕食者的观察范围是有限的，从而强调了对交流的需求。该任务的终止标准是所有代理都观察到猎物，如图7a右侧所示。当任务终止时，掠夺者会得到奖励。我们注意到，代理人可能被赋予不同的观察水平，使他们变得异质。我们在实验中使用了四个代理，其中只有代理1具有5 5视图，而代理2,3和4具有较小的3 3视图。性能指标是捕获猎物所需的时间步数。

合作通信和导航我们采用并修改了Lowe等人的合作通信和导航任务。（2017），我们在一个简单的一维网格中测试SchedNet，如图7b所示。在CCN中，两个代理中的每一个都驻留在其一维网格世界中。每个代理的目标是到达预先指定的目的地（分别用代理1和2的星形或心形的方块表示），并且当两个代理到达目标目的地时，它们收集联合奖励。每个代理都有自己的零观察范围，但它可以观察其他代理的情况。我们在场景中引入了异构性，其中任务开始时的代理 - 目的地距离因代理而异。在我们的示例中，代理2最初位于距其目的地更远的位置，如图7b所示。用于衡量SchedNet性能的指标是完成CCN任务所需的时间步数。

B.2EXPERIMENT DETAILS

Table 1 shows the values of the hyperparameters for the CCN and the PP task. We use Adam opti- mizer to update network parameters and soft target update to update target network. The structure of the networks is the same across tasks. For the critic, we used three hidden layers, and the critic between the scheduler and the action selector shares the first two layers. For the actor, we use one hidden layer; for the encoder and the weight generator, three hidden layers each. Networks use rec- tified linear units for all hidden layers. Because the complexity of the two tasks differ, we sized the hidden layers differently. The actor network and the critic network for the CCN have hidden layers with 8 units and 16 units, respectively. The actor network and the critic network for the PP have hidden layers with 32 units and 64 units, respectively.
表1显示了CCN和PP任务的超参数的值。我们使用Adam optimizer来更新网络参数和软目标更新以更新目标网络。跨任务的网络结构是相同的。对于评论家来说，我们使用了三个隐藏层，调度器和动作选择器之间的批评者共享前两层。对于演员，我们使用一个隐藏层; 对于编码器和重量发生器，每个都有三个隐藏层。网络对所有隐藏层使用重新确定的线性单位。由于两个任务的复杂性不同，我们对隐藏层的大小不同。 CCN的演员网络和评论网络分别隐藏了8个单元和16个单元的层。 PP的演员网络和评论网络分别具有32个单元和64个单元的隐藏层。

Figure 8: Performance evaluation of SchedNet. The graphs show the average time taken to complete the task, where shorter time is better for the agents.
图8：SchedNet的性能评估。图表显示完成任务所需的平均时间，其中代理的时间越短越好。

CADDITIONAL EXPERIMENT RESULTS

C.1PREDATOR AND PREY
Impact of bandwidth (L) and number of schedulable agents (K) Due to communication con- straints, only k agents can communicate and scheduled agents can broadcast their message, each of which has a limited size l due to bandwidth constraints. We see the impact of l and k on the perfor- mance in Figure 8a. As L increases, more information can be encoded into the message, which can be used by other agents to take action. Since the encoder and the actor are trained to maximize the shared goal of all agents, they can achieve higher performance with increasing l. In Figure 8b, we compare the cases where k = 1, 2, 3, and FC in which all agents can access the medium, with l = 1. As we can expect, the general tendency is that the performance grows as k increases.
带宽（L）和可调度代理数量（K）的影响由于通信限制，只有k个代理可以进行通信，并且预定的代理可以广播其消息，每个消息由于带宽限制而具有有限的大小l。我们看到l和k对图8a中的性能的影响。随着L的增加，可以将更多信息编码到消息中，其他代理可以使用它来采取行动。由于编码器和演员经过训练以最大化所有代理的共同目标，因此他们可以通过增加l来实现更高的性能。在图8b中，我们比较了k = 1,2,3和FC的情况，其中所有代理都可以访问介质，l = 1.正如我们所预期的，总的趋势是性能随着k的增加而增长。

Impact of joint scheduling and encoding To study the effect of jointly coupling scheduling and encoding, we devise a comparison against a pre-trained auto-encoder (Bourlard & Kamp, 1988; Hinton & Zemel, 1994). An auto-encoder was trained ahead of time, and the encoder part of this auto-encoder was placed in the Actor’s ENC module in Figure 1. The encoder part is not trained further while training the other parts of network. Henceforth, we name this modified Actor “AE”. Figure 8c shows the learning curve of AE and other baselines. Table 2 highlights the impact of joint scheduling and encoding. The numbers shown are the performance metric normalized to the FC case in the PP environment. While SchedNet-Top(1) took only 2.030 times as long as FC to finish the PP task, the AE-equipped actor took 3.408 times as long as FC. This lets us ascertain that utilizing a pre-trained auto-encoder deprives the agent of the benefit of joint the scheduler and encoder neural network in SchedNet.
What messages agents broadcast In Section 4.1, we attempted to understand what the predator agents communicate when performing PP task where k = 1 and l = 2. In this section, we look into the message in detail. Figure 9 shows the projections of the messages generated by the scheduled agent based on its own observation. In the PP task, the most important information is the location of the prey, and this can be estimated from the observation of other agents. Thus, we are interested in the location information of the prey and other agents. We classify the message into four classes based on which quadrant the prey and the predator are included, and mark each class with different colors. Figure 9a shows the messages for different relative location of prey for agents’ observation, and Figure 9b shows the messages for different locations of the agent who sends the message. We can observe that there is some general trend in the message according to the class. We thus conclude that if the agents observe the prey, they encode into the message the relevant information that is helpful to estimate the location of the prey. The agents who receive this message interpret the message to select action.

联合调度和编码的影响为了研究联合调度和编码的效果，我们设计了一个与预训练的自动编码器的比较（Bourlard＆Kamp，1988; Hinton＆Zemel，1994）。提前训练了自动编码器，并将该自动编码器的编码器部分放置在图1中的Actor的ENC模块中。在训练网络的其他部分时，编码器部分未经过进一步训练。此后，我们将此修改后的Actor命名为“AE”。图8c显示了AE和其他基线的学习曲线。表2突出显示了联合调度和编码的影响。显示的数字是在PP环境中标准化为FC情况的性能指标。虽然SchedNet-Top（1）的完成PP任务的时间仅为FC的2.030倍，但装备AE的演员的时间是FC的3.408倍。这让我们确定利用预先训练的自动编码器剥夺了代理在SchedNet中联合调度器和编码器神经网络的好处。
代理广播的消息在4.1节中，我们试图了解捕食者代理在执行PP任务时所传达的信息，其中k = 1且l = 2.在本节中，我们将详细研究消息。图9显示了由预定代理根据自己的观察生成的消息的预测。在PP任务中，最重要的信息是猎物的位置，这可以通过观察其他代理来估计。因此，我们对猎物和其他代理的位置信息感兴趣。我们根据包含猎物和捕食者的象限将消息分为四类，并用不同的颜色标记每个类。图9a显示了代理观察的猎物的不同相对位置的消息，图9b显示了发送消息的代理的不同位置的消息。我们可以观察到，根据班级，信息中有一些大趋势。因此，我们得出结论，如果代理人观察到猎物，他们会在信息中编码有助于估计猎物位置的相关信息。收到此消息的代理会将消息解释为选择操作。

C.2PARTIAL OBSERVABILITY ISSUE IN SCHEDNET

In MARL, partial observability issue is one of the major problems, and there are two typical ways to tackle this issue. First, using RNN structure to indirectly remember the history can alleviate the partial observability issues. Another way is to use the observations of other agents through communication among them. In this paper, we focused more on the latter because the goal of this paper is to show the importance of learning to schedule in a practical communication environment in which the shared medium contention is inevitable.

Enlarging the observation through communication is somewhat orthogonal to considering temporal correlation. Thus, we can easily merge SchedNet with RNN which can be appropriate to some partially observable environments.

We add one GRU layer into each of individual encoder, action selector, and weight generator of each agent, where each GRU cell has 64 hidden nodes.
Figure 10 shows the result of applying RNN. We imple- ment IDQN with RNN, and the results show that the average steps to complete tasks of IDQN with RNN is slightly smaller than that of IDQN with feed-forward network. In this case, RNN helps to improve the performance by tackling the partial observable issue. On the other hand, SchedNet-RNN and SchedNet achieve similar performance. We think that the communication in SchedNet somewhat resolves the partial observ- able issues, so the impact of considering temporal correlation with RNN is relatively small. Al- though applying RNN to SchedNet is not really that helpful in this simple environment, we expect that in a more complex environment, using the recurrent connection is more helpful.

在MARL中，部分可观察性问题是主要问题之一，有两种典型的方法可以解决这个问题。首先，使用RNN结构间接记住历史可以缓解部分可观察性问题。另一种方法是通过它们之间的通信来使用其他代理的观察结果。在本文中，我们更多地关注后者，因为本文的目的是展示学习在实际通信环境中安排的重要性，在这种环境中共享媒体争用是不可避免的。

通过通信扩大观察与考虑时间相关性有些正交。因此，我们可以轻松地将SchedNet与RNN合并，这可能适合某些部分可观察的环境。

我们在每个代理的每个编码器，动作选择器和权重生成器中添加一个GRU层，其中每个GRU单元具有64个隐藏节点。
图10显示了应用RNN的结果。我们使用RNN实现IDQN，结果表明，使用RNN完成IDQN任务的平均步骤略小于具有前馈网络的IDQN。在这种情况下，RNN通过解决部分可观察问题来帮助提高性能。另一方面，SchedNet-RNN和SchedNet实现了类似的性能。我们认为SchedNet中的通信在某种程度上解决了部分可观察的问题，因此考虑与RNN的时间相关性的影响相对较小。虽然在这个简单的环境中将RNN应用于SchedNet并不是那么有用，但我们希望在更复杂的环境中使用循环连接更有帮助。

C.3 COOPERATIVE COMMUNICATION AND NAVIGATION

Result in CCN Figure 11 illustrates the learning curve of 200,000 steps in CCN. In FC, since all agents can broad- cast their message during execution, they achieve the best performance. IDQN and COMA in which no communication is allowed, take a longer time to complete the task compared to other baselines. The performances of both are similar because no cooperation can be achieved without the exchange of observations in this environment.
As expected, SchedNet and DIAL outperform IDQN and COMA. Although DIAL works well when there is no contention constraint, under the contention constraint, the av- erage number of steps to complete the task in DIAL(1) is larger than that of SchedNet-Top(1). This result shows the same tendency with the result in PP environment.

CCN中的结果图11显示了CCN中200,000步的学习曲线。在FC中，由于所有代理都可以在执行期间广播其消息，因此它们可以实现最佳性能。与其他基线相比，IDQN和COMA不允许通信，完成任务需要更长的时间。两者的表现相似，因为如果没有在这种环境中交换观察，就无法实现合作。
正如预期的那样，SchedNet和DIAL的表现优于IDQN和COMA。虽然DIAL在没有争用约束时运行良好，但在争用约束下，在DIAL（1）中完成任务的平均步数大于SchedNet-Top（1）的步数。该结果显示出与PP环境相同的趋势。

DSCHEDULER FOR DISTRIBUTED EXECUTION

Issues. The role of the scheduler is to consider the constraint due to accessing a shared medium, so that only k < n agents may broadcast their encoded messages. k is determined by the wireless communication environment. For example, under a single wireless channel environment where each agent is located in other agents’ interference range, k = 1. Although the number of agents that can be simultaneously scheduled is somewhat more complex, we abstract it with a single number k because the goal of this paper lies in studying the importance of considering scheduling constraints.
There are two key challenges in designing the scheduler: (i) how to schedule agents in a distributed manner for decentralized execution, and (ii) how to strike a good balance between simplicity in implementation and training, and the integrity of reflecting the current practice of MAC (Medium Access Control) protocols.
问题。调度程序的作用是考虑由于访问共享介质而产生的约束，以便只有k 在设计调度程序时存在两个主要挑战：（i）如何以分布式方式安排代理以进行分散执行，以及（ii）如何在实现和培训的简单性与反映当前实践的完整性之间取得良好平衡 MAC（媒体访问控制）协议。

Weight-based scheduling To tackle the challenges addressed in the previous paragraph, we pro- pose a scheduler, called weight-based scheduler (WSA), that works based on each agent’s individ- ual weight coming from its observation. As shown in Figure 12, the role of WSA is to map from w = [wi]n to c. This scheduling is extremely simple, but more importantly, highly amenable to the philosophy of distributed execution. The remaining checkpoint is whether this principle is capable of efficiently approximating practical wireless scheduling protocols. To this end, we consider the following two weight-based scheduling algorithms among many different protocols that could be devised:

基于权重的调度为了应对前一段中提到的挑战，我们提出了一种称为基于权重的调度程序（WSA）的调度程序，它基于每个代理来自其观察的个体权重。如图12所示，WSA的作用是从w = [wi] n映射到c。这种调度非常简单，但更重要的是，它非常适合分布式执行的理念。剩下的检查点是该原理是否能够有效地近似实际的无线调度协议。为此，我们考虑了可以设计的许多不同协议中的以下两种基于权重的调度算法：

Top(k) can be a nice abstraction of the MaxWeight (Tassiulas & Ephremides, 1992) scheduling principle or its distributed approximation (Yi et al., 2008), in which case it is known that different choices of weight values result in achieving different performance metrics, e.g., using the amount of messages queued for being transmitted as weight. Softmax(k) can be a simplified model of CSMA (Carrier Sense Multiple Access), which forms a basis of 802.11 Wi-Fi. Due to space limitation, we refer the reader to Jiang & Walrand (2010) for detail. We now present how Top(k) and Softmax(k) work.

Top（k）可以很好地抽象出MaxWeight（Tassiulas＆Ephremides，1992）调度原理或其分布式近似（Yi et al。，2008），在这种情况下，已知不同的权重值选择导致实现不同性能指标，例如，使用排队等待作为权重发送的消息量。 Softmax（k）可以是CSMA（载波侦听多路访问）的简化模型，其形成802.11 Wi-Fi的基础。由于篇幅限制，我们会向读者介绍Jiang＆Walrand（2010）的详细信息。我们现在介绍Top（k）和Softmax（k）的工作原理。

D.1CARRIER SENSE MULTIPLE ACCESS (CSMA)

CSMA is the one of typical distributed MAC scheduling in wireless communication system. To show the feasibility of scheduling Top(k) and Softmax(k) in a distributed manner, we will explain the variant of CSMA. In this section, we first present the concept of CSMA.
How does CSMA work? The key idea of CSMA is “listen before transmit”. Under a CSMA algorithm, prior to trying to transmit a packet, senders first check whether the medium is busy or idle, and then transmit the packet only when the medium is sensed as idle, i.e., no one is using the

channel. To control the aggressiveness of such medium access, each sender maintains a backoff timer, which is set to a certain value based on a pre-defined rule. The timer runs only when the medium is idle, and stops otherwise. With the backoff timer, links try to avoid collisions by the following procedure:
Each sender does not start transmission immediately when the medium is sensed idle, but keeps silent until its backoff timer expires.
After a sender grabs the channel, it holds the channel for some duration, called the holding time.
Depending on how to choose the backoff and holding times, there can be many variants of CSMA that work for various purposes such as fairness and throughput. Two examples of these, Top(k) and Softmax(k), are introduced in the following sections.

CSMA是无线通信系统中典型的分布式MAC调度之一。为了说明以分布式方式调度Top（k）和Softmax（k）的可行性，我们将解释CSMA的变体。在本节中，我们首先介绍CSMA的概念。
CSMA如何运作？ CSMA的关键思想是“在发送前收听”。在CSMA算法下，在尝试发送分组之前，发送者首先检查介质是忙还是空闲，然后仅在介质被感测为空闲时发送分组，即没有人使用

渠道。为了控制这种媒体访问的积极性，每个发送者维护一个退避定时器，该退避定时器根据预定义的规则设置为某个值。定时器仅在介质空闲时运行，否则停止。使用退避计时器，链接尝试通过以下过程避免冲突：
当介质被感测为空闲时，每个发送器不立即开始传输，但是在其退避定时器到期之前保持静默。
在发送者抓取信道之后，它将信道保持一段时间，称为保持时间。
根据如何选择退避和保持时间，可以有许多CSMA变体可用于各种目的，例如公平性和吞吐量。其中两个例子，Top（k）和Softmax（k），将在以下章节中介绍。

D.2A VERSION OF Distributed Top(k)

In this subsection, we introduce a simple distributed scheduling algorithm, called Distributed Top(k), which can work with SchedNet-Top(k). It is based on CSMA where each sender determines backoff and holding times as follows. In SchedNet, each agent generates the scheduling weight w based on its own observation. The agent sets its backoff time as 1 w where w is its schedule weight, and it waits for backoff time before it tries to broadcast its message. Once it successfully broadcasts the message, it immediately releases the channel. Thus, the agent with the highest w can grab the channel in a decentralized manner without any message passing. By repeating this for k times, we can realize decentralized Top(k) scheduling.
To show the feasibility of distributed scheduling, we implemented the Distributed Top(k) on Con- tiki network simulator (Dunkels et al., 2004) and run the trained agents for the PP task. In our experiment, Top(k) agents are successfully scheduled 98% of the time, and the 2% failures are due to probabilistic collisions in which one of the colliding agents is randomly scheduled by the de- fault collision avoidance mechanism implemented in Contiki. In this case, agents achieve 98.9% performance compared to the case where Top(k) agents are ideally scheduled.

在本小节中，我们介绍了一种简单的分布式调度算法，称为Distributed Top（k），它可以与SchedNet-Top（k）一起使用。它基于CSMA，其中每个发送者确定退避和保持时间如下。在SchedNet中，每个代理根据自己的观察生成调度权重w。代理将其退避时间设置为1 w，其中w是其计划权重，并在尝试广播其消息之前等待退避时间。一旦成功广播该消息，它立即释放该频道。因此，具有最高w的代理可以以分散的方式获取信道而不传递任何消息。通过重复这个k次，我们可以实现分散的Top（k）调度。
为了展示分布式调度的可行性，我们在Con- tiki网络模拟器上实现了分布式Top（k）（Dunkels等，2004），并为PP任务运行训练有素的代理。在我们的实验中，Top（k）代理在98％的时间成功调度，2％的失败是由于概率冲突导致其中一个冲突代理由Contiki中实现的默认冲突避免机制随机调度。在这种情况下，与理想地安排Top（k）代理的情况相比，代理实现98.9％的性能。

D.3OCSMA ALGORITHM AND Softmax(k)

In this section, we explain the relation between Softmax(k) and the existing CSMA-based wireless MAC protocols, called oCSMA. When we use Softmax(k) in the case of k = 1, the scheduling algorithm directly relates to the channel selection probability of oCSMA algorithms. First, we explain how it works and show that the resulting channel access probability has a same form with Softmax(k).
How does oCSMA work? It is also based on the basic CSMA algorithm. Once each agent gen- erates its scheduling weight wi, it sets bi and hi to satisfy wi = log(bihi). It sets its backoff and holding times following exponential distributions with means 1/bi and hi, respectively. Based on these backoff and holding times, each agent runs the oCSMA algorithm. In this case, if all agents are in the communication range, the probability that agent i is scheduled over time is as follows:

在本节中，我们将解释Softmax（k）与现有的基于CSMA的无线MAC协议（称为oCSMA）之间的关系。当我们在k = 1的情况下使用Softmax（k）时，调度算法直接与oCSMA算法的信道选择概率相关。首先，我们解释它是如何工作的，并表明所得到的信道访问概率与Softmax（k）具有相同的形式。
oCSMA如何运作？它也基于基本的CSMA算法。一旦每个代理生成其调度权重wi，它将bi和hi设置为满足wi = log（bihi）。它使用均值1 / bi和hi分别在指数分布后设置其退避和保持时间。基于这些退避和保持时间，每个代理运行oCSMA算法。在这种情况下，如果所有代理都在通信范围内，则代理i随时间安排的概率如下：

你可能感兴趣的:(强化学习,深度强化学习,论文研读)

强化学习之 DQN、Double DQN、PPO JNU freshman 强化学习强化学习
文章目录通俗理解DQNDoubleDQNPPO结合公式理解通俗理解DQN一个简单的比喻和分步解释来理解DQN（DeepQ-Network，深度Q网络），就像教小朋友学打游戏一样：先理解基础概念：Q学习（Q-Learning）想象你在教一只小狗玩电子游戏（比如打砖块）。小狗每做一个动作（比如“向左移动”或“发射球”），游戏会给出一个奖励（比如得分增加）或惩罚（比如球掉了）。小狗的目标是通过不断尝试，
Python 强化学习算法实用指南（三）绝不原创的飞龙默认分类默认分类
原文：annas-archive.org/md5/e3819a6747796b03b9288831f4e2b00c译者：飞龙协议：CCBY-NC-SA4.0第十一章：理解黑盒优化算法在前几章中，我们研究了强化学习（RL）算法，从基于价值的方法到基于策略的方法，以及从无模型方法到基于模型的方法。在本章中，我们将提供另一种解决序列任务的方法，那就是使用一类黑盒算法——进化算法（EA）。EAs由进化机制
Python 强化学习算法实用指南（二）
原文：annas-archive.org/md5/e3819a6747796b03b9288831f4e2b00c译者：飞龙协议：CCBY-NC-SA4.0第六章：学习随机优化与PG优化到目前为止，我们已经探讨并开发了基于价值的强化学习算法。这些算法通过学习一个价值函数来找到一个好的策略。尽管它们表现良好，但它们的应用受限于一些内在的限制。在本章中，我们将介绍一类新的算法——策略梯度方法，它们通过
【论文阅读】AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting quintus0505 LLM 论文阅读语言模型
AdaCtrl:TowardsAdaptiveandControllableReasoningviaDifficulty-AwareBudgeting3Method3.1长度触发标签作为控制接口（Length-TriggerTagsasControllingInterface）3.2冷启动微调（Cold-startfine-tuning）3.3难度感知的强化学习框架（Difficulty-awar
【论文笔记ing】Pointerformer: Deep Reinforced Multi-Pointer Transformer for the Traveling Salesman Problem Booksort online笔记论文论文阅读 transformer 深度学习
论文中使用一个PointerFormer模型编码器部分：可逆残差模型堆叠解码器部分：指针网络自回归对于一次任务而言，推理阶段：编码器部分：一次解码器部分：循环N次，直至任务结束在训练阶段，使用强化学习，对于一个N个节点的TSP实例，算法中会以不同的起点，跑N次，得到N个轨迹，以满足TSP的对称特性，表示这都是属于一个TSP问题的（真实）解然后会计算这样表示归一化奖励，得到一个advantage,然
四、Actor-Critic Methods 沈夢昂志 DRL深度强化学习 python 深度学习
由于在看DRL论文中，很多公式都很难理解。因此最近在学习DRL的基本内容。再此说明，非常推荐B站“王树森老师的DRL强化学习”本文的图表及内容，都是基于王老师课程的后自行理解整理出的内容。目录A.书接上回1、Reinforce算法B.State-ValueFunctionC.PolicyNetWork（Actor）D.ActionValueNetwork(Critic)E.TraintheNeur
语言模型 RLHF 实践指南（一）：策略网络、价值网络与 PPO 损失函数
在使用ProximalPolicyOptimization（PPO）对语言模型进行强化学习微调（如RLHF）时，大家经常会问：策略网络的动作概率是怎么来的？价值网络的得分是如何计算的？奖励从哪里来？损失函数怎么构建？微调后的旧轨迹还能用吗？这篇文章将以语言模型强化学习微调为例，结合实际实现和数学公式，深入解析PPO的关键计算流程。1️⃣策略网络：如何计算动作概率？策略网络πθ(a∣s)\pi_\t
【零基础学AI】第33讲：强化学习基础 - 游戏AI智能体 1989 0基础学AI 人工智能游戏 transformer 分类深度学习神经网络
本节课你将学到理解强化学习的基本概念和框架掌握Q-learning算法原理使用Python实现贪吃蛇游戏AI训练能够自主玩游戏的智能体开始之前环境要求Python3.8+PyTorch2.0+Gymnasium(原OpenAIGym)NumPyMatplotlib推荐使用JupyterNotebook进行实验前置知识Python基础编程（第1-8讲）基本数学概念（函数、导数）神经网络基础（第23讲
在Carla上应用深度强化学习实现自动驾驶（一）寒霜似karry 自动驾驶人工智能机器学习
carla环境下基于强化学习的自动驾驶_哔哩哔哩_bilibili本篇文章是小编在pycharm上自己手敲代码学习自动驾驶的第一篇文章，主要讲述如何在Carla中控制我们自己生成的汽车并且使用rgb摄像头传感器获取图像数据。以下代码参考自：（如有侵权，请联系我将立即删除）使用Carla和Python的自动驾驶汽车第2部分——控制汽车并获取传感器数据-CSDN博客1、导入carla（其中的路径根据自
【AI论文】Skywork-Reward-V2：通过人机协同实现偏好数据整理的规模化扩展
摘要：尽管奖励模型（RewardModels，RMs）在基于人类反馈的强化学习（ReinforcementLearningfromHumanFeedback，RLHF）中发挥着关键作用，但当前最先进的开源奖励模型在大多数现有评估基准上表现欠佳，无法捕捉人类复杂且微妙的偏好谱系。即便采用先进训练技术的方法也未能显著提升性能。我们推测，这种脆弱性主要源于偏好数据集的局限性——这些数据集往往范围狭窄、标
多智能体深度强化学习：一项综述 Multi-agent deep reinforcement learning: a survey 资源存储库笔记
Abstract抽象Theadvancesinreinforcementlearninghaverecordedsublimesuccessinvariousdomains.Althoughthemulti-agentdomainhasbeenovershadowedbyitssingle-agentcounterpartduringthisprogress,multi-agentreinforc
r语言改变数据框列名_数据决定离线强化学习将如何改变我们的语言习惯杨_明 python 大数据人工智能 java 机器学习
r语言改变数据框列名重点(Tophighlight)Aridesharingcompanycollectsadatasetofpricinganddiscountdecisionswithcorrespondingchangesincustomeranddriverbehavior,inordertooptimizeadynamicpricingstrategy.Anonlinevendorrec
ReAct (Reason and Act) OR 强化学习（Reinforcement Learning, RL） SugarPPig 人工智能人工智能
这个问题触及了现代AI智能体（Agent）构建的两种核心思想。简单来说，ReAct是一种“调用专家”的模式，而强化学习(RL)是一种“从零试错”的模式。为了让你更清晰地理解，我们从一个生动的比喻开始，然后进行详细的对比。一个生动的比喻想象一下你要完成一项复杂的任务，比如“策划一场完美的生日派对”。ReAct的方式（像一位经验丰富的活动策划师）你是一位知识渊博的专家（大语言模型LLM）。你首先会思考
【AI论文】GLM-4.1V-思考：借助可扩展强化学习实现通用多模态推理东临碣石82 人工智能
摘要：我们推出GLM-4.1V-Thinking这一视觉语言模型（VLM），该模型旨在推动通用多模态推理的发展。在本报告中，我们分享了在以推理为核心的训练框架开发过程中的关键发现。我们首先通过大规模预训练开发了一个具备显著潜力的高性能视觉基础模型，可以说该模型为最终性能设定了上限。随后，借助课程采样强化学习（ReinforcementLearningwithCurriculumSampling，R
【心灵鸡汤】深度学习技能形成树：从零基础到AI专家的成长路径全解析智算菩萨人工智能深度学习
引言：技能树的生长哲学在这个人工智能浪潮汹涌的时代，深度学习犹如一棵参天大树，其根系深深扎入数学与计算科学的沃土，主干挺拔地承载着机器学习的核心理念，而枝叶则繁茂地延伸至计算机视觉、自然语言处理、强化学习等各个应用领域。对于初入此领域的新手而言，理解这棵技能树的生长规律，掌握其形成过程中的关键节点和发展阶段，将直接决定其在人工智能道路上能够走多远、攀多高。技能树的概念源于游戏设计，但在学习深度学习
【机器学习笔记 Ⅱ】10 完整周期
机器学习的完整生命周期（End-to-EndPipeline）机器学习的完整周期涵盖从问题定义到模型部署的全过程，以下是系统化的步骤分解和关键要点：1.问题定义（ProblemDefinition）目标：明确业务需求与机器学习任务的匹配性。关键问题：这是分类、回归、聚类还是强化学习问题？成功的标准是什么？（如准确率>90%、降低10%成本）输出：项目目标文档（含评估指标）。2.数据收集（DataC
大模型RLHF强化学习笔记（二）：强化学习基础梳理Part2 Gravity! 大模型笔记大模型 LLM 强化学习人工智能
【如果笔记对你有帮助，欢迎关注&点赞&收藏，收到正反馈会加快更新！谢谢支持！】一、强化学习基础1.4强化学习分类根据数据来源划分Online：智能体与环境实时交互，如Q-Learning、SARSA、Actor-CriticOffline：智能体使用预先收集的数据集进行学习根据策略更新划分On-Policy：学习和行为策略是相同的，数据是按照当前策略生成的，如SARSAOff-Policy：学习策
爆改RAG！用强化学习让你的检索增强生成系统“开挂”——从小白到王者的实战指南许泽宇的技术分享人工智能
“RAG不准？RL来救场！”——一位被RAG气哭的AI工程师前言：RAG的烦恼与AI炼丹师的自我修养在AI圈混久了，大家都知道RAG（Retrieval-AugmentedGeneration，检索增强生成）是大模型落地的“万金油”方案。无论是企业知识库、智能问答，还是搜索引擎升级，RAG都能插上一脚。但你用过RAG就知道，理想很丰满，现实很骨感。明明知识库里啥都有，问个“量子比特的数学表达式”，
机器学习18-强化学习RLHF 坐吃山猪机器学习机器学习人工智能
机器学习18-强化学习RLHF1-什么是RLHFRLHF（ReinforcementLearningfromHumanFeedback）即基于人类反馈的强化学习算法，以下是详细介绍：基本原理RLHF是一种结合了强化学习和人类反馈的机器学习方法。传统的强化学习通常依赖于预定义的奖励函数来指导智能体的学习，而RLHF则通过引入人类的反馈来替代或补充传统的奖励函数。在训练过程中，人类会对智能体的行为或输
策略梯度在网络安全中的应用：AI如何防御网络攻击 AI智能探索者 web安全人工智能安全 ai
策略梯度在网络安全中的应用：AI如何防御网络攻击关键词：策略梯度、网络安全、AI防御、强化学习、网络攻击、入侵检测、自适应防御摘要：本文将探讨策略梯度这一强化学习算法在网络安全领域的创新应用。我们将从基础概念出发，逐步揭示AI如何通过学习网络攻击模式来构建自适应防御系统，分析其核心算法原理，并通过实际代码示例展示实现过程。文章还将讨论当前应用场景、工具资源以及未来发展趋势，为读者提供对这一前沿技术
2024大模型秋招LLM相关面试题整理 AGI大模型资料分享官人工智能深度学习机器学习自然语言处理语言模型 easyui
0一些基础术语大模型：一般指1亿以上参数的模型，但是这个标准一直在升级，目前万亿参数以上的模型也有了。大语言模型（LargeLanguageModel，LLM）是针对语言的大模型。175B、60B、540B等：这些一般指参数的个数，B是Billion/十亿的意思，175B是1750亿参数，这是ChatGPT大约的参数规模。强化学习：（ReinforcementLearning）一种机器学习的方法，
【深度学习】强化学习（Reinforcement Learning, RL）主流架构解析烟锁池塘柳0 机器学习与深度学习深度学习人工智能机器学习
强化学习（ReinforcementLearning,RL）主流架构解析摘要：本文将带你深入了解强化学习（ReinforcementLearning,RL）的几种核心架构，包括基于价值（Value-Based）、基于策略（Policy-Based）和演员-评论家（Actor-Critic）方法。我们将探讨它们的基本原理、优缺点以及经典算法，帮助你构建一个清晰的RL知识体系。文章目录强化学习（Rei
返利佣金最高软件的技术壁垒：基于强化学习的动态佣金算法架构揭秘
返利佣金最高软件的技术壁垒：基于强化学习的动态佣金算法架构揭秘大家好，我是阿可，微赚淘客系统及省赚客APP创始人，是个冬天不穿秋裤，天冷也要风度的程序猿！一、背景介绍在返利佣金软件中，动态佣金算法是提升用户活跃度和平台收益的关键技术。传统的佣金算法通常是静态的，无法根据用户的实时行为和市场动态进行调整。为了突破这一技术瓶颈，我们引入了强化学习（ReinforcementLearning,RL），通
农业物联网平台中的灌溉系统研究 sj52abcd 农业物联网和人工智能物联网数据分析 python 大数据毕业设计
研究目的本研究旨在开发一个基于Python语言的农业物联网平台，整合土壤墒情监测与精准灌溉系统，通过现代信息技术手段实现农业生产的智能化管理。系统将采用Python作为主要开发语言，结合MySQL数据库进行数据存储与管理，利用ECharts.js实现数据可视化展示，并引入机器学习和强化学习算法优化灌溉决策。具体目标包括：1)构建实时土壤墒情监测网络，通过物联网传感器采集土壤温湿度、电导率等关键参数
用于人形机器人强化学习运动的神经网络架构分析
1.引言：人形机器人运动强化学习中的架构探索人形机器人具备在多样化环境中自主运行的巨大潜力，有望缓解工厂劳动力短缺、协助居家养老以及探索新星球等问题。其拟人化的特性使其在执行类人操作任务（如运动和操纵）方面具有独特优势。深度强化学习（DRL）作为一种前景广阔的无模型方法，能够有效控制双足运动，实现复杂行为的自主学习，而无需显式动力学模型。1.1人形机器人运动强化学习的机遇与挑战尽管DRL取得了显著
人形机器人运动控制技术演进：从强化学习到神经微分方程的前沿解析
1.引言：人形运动控制的挑战与范式迁移人形机器人需在非结构化环境中实现双足行走、跑步、跳跃等复杂动作，其核心问题可归结为高维连续状态-动作空间的实时优化。传统方法（如基于模型的预测控制MPC）依赖精确的动力学建模，但在实际系统中面临以下瓶颈：模型失配：复杂接触动力学（如足-地交互）难以显式建模；计算瓶颈：高维非线性优化难以满足实时性需求；环境扰动敏感：传统控制器对未知干扰的鲁棒性不足。近年来，以强
NVIDIA Isaac GR00T N1.5 人形机器人强化学习入门教程（五）强化学习与机器人控制仿真机器人与具身智能人工智能机器人深度学习神经网络强化学习模仿学习具身智能
系列文章目录目录系列文章目录前言一、更深入的理解1.1实体化动作头微调1.1.1实体标签1.1.2工作原理1.1.3支持的实现1.2高级调优参数1.2.1模型组件1.2.1.1视觉编码器（tune_visual）1.2.1.2语言模型（tune_llm）1.2.1.3投影器（tune_projector）1.2.1.4扩散模型（tune_diffusion_model）1.2.2理解数据转换1.2
强化学习：Deep Deterministic Policy Gradient (DDPG) 学习笔记烨川南强化学习学习笔记算法人工智能机器学习
一、DDPG是什么？1.1核心概念DDPG=Deep+Deterministic+PolicyGradientDeep：使用深度神经网络和类似DQN的技术（经验回放、目标网络）Deterministic：输出确定的动作（而不是概率分布）PolicyGradient：基于策略梯度的方法，优化策略以最大化累积奖励1.2算法特点特性说明连续动作空间直接输出连续动作值（如方向盘角度、机器人关节扭矩）离线学
提升自动驾驶导航能力：基于深度学习的场景理解技术星辰和大海都需要门票路径规划算法自动驾驶深度学习人工智能
EnhancingAutonomousVehicleNavigationUsingDeepLearning-BasedSceneUnderstanding提升自动驾驶导航能力：基于深度学习的场景理解技术摘要-为应对复杂环境下的自动驾驶导航，系统高度依赖场景理解的准确性。本研究提出一种基于深度学习的新方法，将目标识别、场景分割、运动预测与强化学习相结合以提升导航性能。该方法首先采用U-Net架构分解
【EI复现】基于深度强化学习的微能源网能量管理与优化策略研究（Python代码实现）
欢迎来到本博客❤️❤️博主优势：博客内容尽量做到思维缜密，逻辑清晰，为了方便读者。⛳️座右铭：行百里者，半于九十。本文目录如下：目录1概述一、微能源网能量管理的基本概念与核心需求二、深度强化学习（DRL）在微能源网中的应用优势三、关键技术挑战四、现有基于DRL的优化策略案例五、相关研究文档的典型结构与撰写规范六、结论与未来方向2运行结果2.1有/无策略奖励2.2训练结果12.2训练结果23参考文献
js动画html标签（持续更新中） 843977358 html js 动画 media opacity
1.jQuery 效果 - animate() 方法改变 "div" 元素的高度： $(".btn1").click(function(){ $("#box").animate({height:"300px
springMVC学习笔记 caoyong springMVC
1、搭建开发环境 a>、添加jar文件，在ioc所需jar包的基础上添加spring-web.jar,spring-webmvc.jar b>、在web.xml中配置前端控制器 <servlet> &nbs
POI中设置Excel单元格格式 107x poi style 列宽合并单元格自动换行
引用：http://apps.hi.baidu.com/share/detail/17249059 POI中可能会用到一些需要设置EXCEL单元格格式的操作小结：先获取工作薄对象: HSSFWorkbook wb = new HSSFWorkbook(); HSSFSheet sheet = wb.createSheet(); HSSFCellStyle setBorder = wb.
jquery 获取A href 触发js方法的this参数无效的情况一炮送你回车库 jquery
html如下： <td class=\"bord-r-n bord-l-n c-333\"> <a class=\"table-icon edit\" onclick=\"editTrValues(this);\">修改</a> </td>" j
md5 3213213333332132 MD5
import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; public class MDFive { public static void main(String[] args) { String md5Str = "cq
完全卸载干净Oracle11g sophia天雪 orale数据库卸载干净清理注册表
完全卸载干净Oracle11g A、存在OUI卸载工具的情况下：第一步：停用所有Oracle相关的已启动的服务；第二步：找到OUI卸载工具：在“开始”菜单中找到“oracle_OraDb11g_home”文件夹中 &
apache 的access.log 日志文件太大如何解决 darkranger apache
CustomLog logs/access.log common 此写法导致日志数据一致自增变大。直接注释上面的语法 #CustomLog logs/access.log common 增加： CustomLog "|bin/rotatelogs.exe -l logs/access-%Y-%m-d.log
Hadoop单机模式环境搭建关键步骤 aijuans 分布式
Hadoop环境需要sshd服务一直开启，故，在服务器上需要按照ssh服务，以Ubuntu Linux为例，按照ssh服务如下： sudo apt-get install ssh sudo apt-get install rsync 编辑HADOOP_HOME/conf/hadoop-env.sh文件，将JAVA_HOME设置为Java
PL/SQL DEVELOPER 使用的一些技巧 atongyeye java sql
1 记住密码这是个有争议的功能，因为记住密码会给带来数据安全的问题。但假如是开发用的库，密码甚至可以和用户名相同，每次输入密码实在没什么意义，可以考虑让PLSQL Developer记住密码。位置：Tools菜单－－Preferences－－Oracle－－Logon HIstory－－Store with password 2 特殊Copy 在SQL Window
PHP：在对象上动态添加一个新的方法 bardo 方法动态添加闭包
有关在一个对象上动态添加方法，如果你来自Ruby语言或您熟悉这门语言，你已经知道它是什么...... Ruby提供给你一种方式来获得一个instancied对象，并给这个对象添加一个额外的方法。好！不说Ruby了，让我们来谈谈PHP PHP未提供一个“标准的方式”做这样的事情，这也是没有核心的一部分... 但无论如何，它并没有说我们不能做这样
ThreadLocal与线程安全 bijian1013 java java多线程 threadLocal
首先来看一下线程安全问题产生的两个前提条件： 1.数据共享，多个线程访问同样的数据。 2.共享数据是可变的，多个线程对访问的共享数据作出了修改。实例：定义一个共享数据： public static int a = 0;
Tomcat 架包冲突解决征客丶 tomcat Web
环境： Tomcat 7.0.6 win7 x64 错误表象：【我的冲突的架包是：catalina.jar 与 tomcat-catalina-7.0.61.jar 冲突，不知道其他架包冲突时是不是也报这个错误】严重: End event threw exception java.lang.NoSuchMethodException: org.apache.catalina.dep
【Scala三】分析Spark源代码总结的Scala语法一 bit1129 scala
Scala语法 1. classOf运算符 Scala中的classOf[T]是一个class对象，等价于Java的T.class,比如classOf[TextInputFormat]等价于TextInputFormat.class 2. 方法默认值 defaultMinPartitions就是一个默认值，类似C++的方法默认值
java 线程池管理机制 BlueSkator java线程池管理机制
编辑 Add Tools jdk线程池一、引言第一：降低资源消耗。通过重复利用已创建的线程降低线程创建和销毁造成的消耗。第二：提高响应速度。当任务到达时，任务可以不需要等到线程创建就能立即执行。第三：提高线程的可管理性。线程是稀缺资源，如果无限制的创建，不仅会消耗系统资源，还会降低系统的稳定性，使用线程池可以进行统一的分配，调优和监控。
关于hql中使用本地sql函数的问题（问-答） BreakingBad HQL 存储函数
转自于：http://www.iteye.com/problems/23775 问：我在开发过程中，使用hql进行查询（mysql5）使用到了mysql自带的函数find_in_set()这个函数作为匹配字符串的来讲效率非常好，但是我直接把它写在hql语句里面（from ForumMemberInfo fm,ForumArea fa where find_in_set(fm.userId,f
读《研磨设计模式》-代码笔记-迭代器模式-Iterator bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.util.Arrays; import java.util.List; /** * Iterator模式提供一种方法顺序访问一个聚合对象中各个元素，而又不暴露该对象内部表示 * * 个人觉得，为了不暴露该
常用SQL chenjunt3 oracle sql C++c C#
--NC建库 CREATE TABLESPACE NNC_DATA01 DATAFILE 'E:\oracle\product\10.2.0\oradata\orcl\nnc_data01.dbf' SIZE 500M AUTOEXTEND ON NEXT 50M EXTENT MANAGEMENT LOCAL UNIFORM SIZE 256K ; CREATE TABLESPA
数学是科学技术的语言 comsci 工作活动领域模型
从小学到大学都在学习数学，从小学开始了解数字的概念和背诵九九表到大学学习复变函数和离散数学，看起来好像掌握了这些数学知识，但是在工作中却很少真正用到这些知识，为什么？最近在研究一种开源软件-CARROT2的源代码的时候，又一次感觉到数学在计算机技术中的不可动摇的基础作用，CARROT2是一种用于自动语言分类（聚类）的工具性软件，用JAVA语言编写，它
Linux系统手动安装rzsz 软件包 daizj linux sz rz
1、下载软件 rzsz-3.34.tar.gz。登录linux，用命令 wget http://freeware.sgi.com/source/rzsz/rzsz-3.48.tar.gz下载。 2、解压 tar zxvf rzsz-3.34.tar.gz 3、安装 cd rzsz-3.34 ; make posix 。注意：这个软件安装与常规的GNU软件不
读源码之:ArrayBlockingQueue dieslrae java
ArrayBlockingQueue是concurrent包提供的一个线程安全的队列,由一个数组来保存队列元素.通过 takeIndex和 putIndex来分别记录出队列和入队列的下标,以保证在出队列时不进行元素移动. //在出队列或者入队列的时候对takeIndex或者putIndex进行累加,如果已经到了数组末尾就又从0开始,保证数
C语言学习九枚举的定义和应用 dcj3sjt126com c
枚举的定义 # include <stdio.h> enum WeekDay { MonDay, TuesDay, WednesDay, ThursDay, FriDay, SaturDay, SunDay }; int main(void) { //int day; //day定义成int类型不合适 enum WeekDay day = Wedne
Vagrant 三种网络配置详解 dcj3sjt126com vagrant
Forwarded port Private network Public network Vagrant 中一共有三种网络配置，下面我们将会详解三种网络配置各自优缺点。端口映射(Forwarded port)，顾名思义是指把宿主计算机的端口映射到虚拟机的某一个端口上，访问宿主计算机端口时，请求实际是被转发到虚拟机上指定端口的。Vagrantfile中设定语法为： c
16.性能优化-完结 frank1234 性能优化
性能调优是一个宏大的工程，需要从宏观架构(比如拆分，冗余，读写分离，集群，缓存等)，软件设计（比如多线程并行化，选择合适的数据结构），数据库设计层面（合理的表设计，汇总表，索引，分区，拆分，冗余等）以及微观（软件的配置，SQL语句的编写，操作系统配置等）根据软件的应用场景做综合的考虑和权衡，并经验实际测试验证才能达到最优。性能水很深，笔者经验尚浅，赶脚也就了解了点皮毛而已，我觉得
Word Search hcx2013 search
Given a 2D board and a word, find if the word exists in the grid. The word can be constructed from letters of sequentially adjacent cell, where "adjacent" cells are those horizontally or ve
Spring4新特性——Web开发的增强 jinnianshilongnian spring spring mvc spring4
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
CentOS安装配置tengine并设置开机启动 liuxingguome centos
yum install gcc-c++ yum install pcre pcre-devel yum install zlib zlib-devel yum install openssl openssl-devel Ubuntu上可以这样安装 sudo aptitude install libdmalloc-dev libcurl4-opens
第14章工具函数（上） onestopweb 函数
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
Xelsius 2008 and SAP BW at a glance blueoxygen BO Xelsius
Xelsius提供了丰富多样的数据连接方式，其中为SAP BW专属提供的是BICS。那么Xelsius的各种连接的优缺点比较以及Xelsius是如何直接连接到BEx Query的呢？以下Wiki文章应该提供了全面的概览。 http://wiki.sdn.sap.com/wiki/display/BOBJ/Xcelsius+2008+and+SAP+NetWeaver+BW+Co
oracle表空间相关 tongsh6 oracle
在oracle数据库中，一个用户对应一个表空间，当表空间不足时，可以采用增加表空间的数据文件容量，也可以增加数据文件，方法有如下几种： 1.给表空间增加数据文件 ALTER TABLESPACE "表空间的名字" ADD DATAFILE '表空间的数据文件路径' SIZE 50M; &nb
.Net framework4.0安装失败 yangjuanjava .net windows
上午的.net framework 4.0，各种失败，查了好多答案，各种不靠谱，最后终于找到答案了和Windows Update有关系，给目录名重命名一下再次安装，即安装成功了！下载地址：http://www.microsoft.com/en-us/download/details.aspx?id=17113 方法： 1.运行cmd，输入net stop WuAuServ 2.点击开