Many real-world reinforcement learning tasks require multiple agents to make se- quential decisions under the agents’ interaction, where well-coordinated actions among the agents are crucial to achieve the target goal better at these tasks. One way to accelerate the coordination effect is to enable multiple agents to communi- cate with each other in a distributed manner and behave as a group. In this paper, we study a practical scenario when (i) the communication bandwidth is limited and (ii) the agents share the communication medium so that only a restricted num- ber of agents are able to simultaneously use the medium, as in the state-of-the-art wireless networking standards. This calls for a certain form of communication scheduling. In that regard, we propose a multi-agent deep reinforcement learn- ing framework, called SchedNet, in which agents learn how to schedule them- selves, how to encode the messages, and how to select actions based on received messages. SchedNet is capable of deciding which agents should be entitled to broadcasting their (encoded) messages, by learning the importance of each agent’s partially observed information. We evaluate SchedNet against multiple baselines under two different applications, namely, cooperative communication and naviga- tion, and predator-prey. Our experiments show a non-negligible performance gap between SchedNet and other mechanisms such as the ones without communica- tion and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.
许多现实世界的强化学习任务需要多个代理在代理的交互下做出明确的决策,代理之间协调良好的行为对于在这些任务中更好地实现目标目标至关重要。加速协调效果的一种方法是使多个代理以分布式方式相互通信并以组的形式运行。在本文中,我们研究了一个实际场景:(i)通信带宽有限,以及(ii)代理共享通信介质,以便只有有限数量的代理能够同时使用介质,如州最先进的无线网络标准。这需要某种形式的通信调度。在这方面,我们提出了一个名为SchedNet的多智能体深度强化学习框架,其中代理学习如何自我调度,如何编码消息,以及如何根据接收的消息选择动作。通过了解每个代理的部分观察信息的重要性,SchedNet能够决定哪些代理应该有权广播其(编码的)消息。我们在两种不同的应用程序(即合作通信和导航以及捕食者 - 猎物)下针对多个基线评估SchedNet。我们的实验表明,SchedNet与其他机制之间存在不可忽视的性能差距,例如没有通信的机制和vanilla调度方法,例如循环,范围从32%到43%。
Reinforcement Learning (RL) has garnered renewed interest in recent years. Playing the game of Go (Mnih et al., 2015), robotics control (Gu et al., 2017; Lillicrap et al., 2015), and adaptive video streaming (Mao et al., 2017) constitute just a few of the vast range of RL applications. Combined with developments in deep learning, deep reinforcement learning (Deep RL) has emerged as an accelerator in related fields. From the well-known success in single-agent deep reinforcement learn- ing, such as Mnih et al. (2015), we now witness growing interest in its multi-agent extension, the multi-agent reinforcement learning (MARL), exemplified in Gupta et al. (2017); Lowe et al. (2017); Foerster et al. (2017a); Omidshafiei et al. (2017); Foerster et al. (2016); Sukhbaatar et al. (2016); Mordatch & Abbeel (2017); Havrylov & Titov (2017); Palmer et al. (2017); Peng et al. (2017); Fo- erster et al. (2017c); Tampuu et al. (2017); Leibo et al. (2017); Foerster et al. (2017b). In the MARL problem commonly addressed in these works, multiple agents interact in a single environment re- peatedly and improve their policy iteratively by learning from observations to achieve a common goal. Of particular interest is the distinction between two lines of research: one fostering the direct communication among agents themselves, as in Foerster et al. (2016); Sukhbaatar et al. (2016) and the other coordinating their cooperative behavior without direct communication, as in Foerster et al. (2017b); Palmer et al. (2017); Leibo et al. (2017).
In this work, we concern ourselves with the former. We consider MARL scenarios wherein the task at hand is of a cooperative nature and agents are situated in a partially observable environment, but each endowed with different observation power. We formulate this scenario into a multi-agent
强化学习(RL)近年来引起了人们的兴趣。玩Go(Mnih等人,2015),机器人控制(Gu et al。,2017; Lillicrap et al。,2015)和自适应视频流(Mao et al。,2017)只是其中的一小部分。广泛的RL应用。结合深度学习的发展,深度强化学习(Deep RL)已成为相关领域的加速器。从众所周知的单一代理深度强化学习的成功,如Mnih等。 (2015年),我们现在看到人们对它的多智能体扩展,多智能体强化学习(MARL)越来越感兴趣,例如Gupta等。 (2017); Lowe等人。 (2017); Foerster等。 (2017A); Omidshafiei等。 (2017); Foerster等。 (2016); Sukhbaatar等。 (2016); Mordatch&Abbeel(2017); Havrylov&Titov(2017);帕尔默等人。 (2017);彭等人。 (2017);福斯特等人。 (2017c); Tampuu等人。 (2017); Leibo等人。 (2017); Foerster等。 (2017b)。在这些工作中通常涉及的MARL问题中,多个代理人在单个环境中反复交互,并通过从观察中学习来实现共同目标,从而迭代地改进其策略。特别感兴趣的是两种研究方法之间的区别:一种促进代理人自身之间的直接沟通,如Foerster等人所述。 (2016); Sukhbaatar等。 (2016)和另一个协调他们的合作行为,没有直接沟通,如Foerster等人。 (2017b);帕尔默等人。 (2017); Leibo等人。 (2017年)。
在这项工作中,我们关注前者。我们考虑MARL场景,其中手头的任务具有合作性质,并且代理位于部分可观察的环境中,但每个都具有不同的观察能力。我们将此场景制定为多代理
sequential decision-making problem, such that all agents share the goal of maximizing the same discounted sum of rewards. For the agents to directly communicate with each other and behave as a coordinated group rather than merely coexisting individuals, they must carefully determine the information they exchange under a practical bandwidth-limited environment and/or in the case of high-communication cost. To coordinate this exchange of messages, we adopt the centralized training and distributed execution paradigm popularized in recent works, e.g., Foerster et al. (2017a); Lowe et al. (2017); Sunehag et al. (2018); Rashid et al. (2018); Gupta et al. (2017).
In addition to bandwidth-related constraints, we take the issues of sharing the communication medium into consideration, especially when agents communicate over wireless channels. The state- of-the-art standards on wireless communication such as Wi-Fi and LTE specify the way of schedul- ing users as one of the basic functions. However, as elaborated in Related work, MARL problems involving scheduling of only a restricted set of agents have not yet been extensively studied. The key challenges in this problem are: (i) that limited bandwidth implies that agents must exchange suc- cinct information: something concise and yet meaningful and (ii) that the shared medium means that potential contenders must be appropriately arbitrated for proper collision avoidance, necessitating a certain form of communication scheduling, popularly referred to as MAC (Medium Access Control) in the area of wireless communication. While stressing the coupled nature of the encoding/decoding and the scheduling issue, we zero in on the said communication channel-based concerns and con- struct our neural network accordingly.
Contributions In this paper, we propose a new deep multi-agent reinforcement learning architec- ture, called SchedNet, with the rationale of centralized training and distributed execution in order to achieve a common goal better via decentralized cooperation. During distributed execution, agents are allowed to communicate over wireless channels where messages are broadcast to all agents in each agent’s communication range. This broadcasting feature of wireless communication necessi- tates a Medium Access Control (MAC) protocol to arbitrate contending communicators in a shared medium. CSMA (Collision Sense Multiple Access) in Wi-Fi is one such MAC protocol. While prior work on MARL to date considers only the limited bandwidth constraint, we additionally address the shared medium contention issue in what we believe is the first work of its kind: which nodes are granted access to the shared medium. Intuitively, nodes with more important observations should be chosen, for which we adopt a simple yet powerful mechanism called weight-based scheduler (WSA), designed to reconcile simplicity in training with integrity of reflecting real-world MAC protocols in use (e.g., 802.11 Wi-Fi). We evaluate SchedNet for two applications: cooperative communication and navigation and predator/prey and demonstrate that SchedNet outperforms other baseline mech- anisms such as the one without any communication or with a simple scheduling mechanism such as round robin. We comment that SchedNet is not intended for competing with other algorithms for cooperative multi-agent tasks without considering scheduling, but a complementary one. We believe that adding our idea of agent scheduling makes those algorithms much more practical and valuable.
顺序决策问题,使所有代理人共享最大化相同折扣的奖励总额的目标。为了使代理彼此直接通信并且作为协调组而不仅仅是共存的个体,他们必须仔细地确定他们在实际带宽有限的环境下和/或在高通信成本的情况下交换的信息。为了协调这种信息交换,我们采用了近期工作中普及的集中训练和分布式执行范式,例如Foerster等。 (2017A); Lowe等人。 (2017); Sunehag等人。 (2018);拉希德等人。 (2018);古普塔等人。 (2017年)。
除了与带宽相关的约束之外,我们还考虑了共享通信介质的问题,尤其是当代理通过无线信道进行通信时。 Wi-Fi和LTE等无线通信的最新标准规定了将用户作为基本功能之一的方式。但是,正如相关工作中所阐述的那样,尚未广泛研究涉及仅限制一组有限代理人的MARL问题。这个问题的主要挑战是:(i)有限的带宽意味着代理商必须交换成功的信息:简洁而有意义的事情;(ii)共享媒体意味着潜在的竞争者必须适当地进行仲裁,以确保适当的避免碰撞,需要某种形式的通信调度,在无线通信领域通常称为MAC(媒体访问控制)。在强调编码/解码和调度问题的耦合性质的同时,我们依赖于所述基于通信信道的关注点并相应地构建我们的神经网络。
贡献在本文中,我们提出了一个新的深度多智能体强化学习架构,称为SchedNet,其基本原理是集中训练和分布式执行,以便通过分散协作更好地实现共同目标。在分布式执行期间,允许代理通过无线信道进行通信,其中消息被广播到每个代理的通信范围中的所有代理。这种无线通信的广播特征需要媒体访问控制(MAC)协议来仲裁共享媒体中的竞争通信器。 Wi-Fi中的CSMA(冲突感知多路访问)就是这样一种MAC协议。虽然迄今为止MARL的先前工作仅考虑有限带宽约束,但我们还在我们认为是同类中的第一项工作中解决了共享媒体争用问题:哪些节点被授权访问共享媒体。直观地,应该选择具有更重要观察结果的节点,为此我们采用称为基于权重的调度程序(WSA)的简单而强大的机制,旨在协调训练中的简单性与反映使用中的现实MAC协议的完整性(例如,802.11)无线上网)。我们针对两种应用评估SchedNet:协作通信和导航以及捕食者/猎物,并证明SchedNet优于其他基线机制,例如没有任何通信的机制或者诸如循环的简单调度机制。我们评论说SchedNet不是为了与其他算法竞争协作多代理任务而不考虑调度,而是一个互补的调度。我们相信,添加我们的代理调度思想会使这些算法更加实用和有价值。
Related work We now discuss the body of relevant literature. Busoniu et al. (2008) and Tan (1993) have studied MARL with decentralized execution extensively. However, these are based on tabular methods so that they are restricted to simple environments. Combined with developments in deep learning, deep MARL algorithms have emerged (Tampuu et al., 2017; Foerster et al., 2017a; Lowe et al., 2017). Tampuu et al. (2017) uses a combination of DQN with independent Q-learning. This independent learning does not perform well because each agent considers the others as a part of envi- ronment and ignores them. Foerster et al. (2017a); Lowe et al. (2017); Gupta et al. (2017); Sunehag et al. (2018), and Foerster et al. (2017b) adopt the framework of centralized training with decen- tralized execution, empowering the agent to learn cooperative behavior considering other agents’ policies without any communication in distributed execution.
It is widely accepted that communication can further enhance the collective intelligence of learning agents in their attempt to complete cooperative tasks. To this end, a number of papers have previ- ously studied the learning of communication protocols and languages to use among multiple agents in reinforcement learning. We explore those bearing the closest resemblance to our research. Foer- ster et al. (2016); Sukhbaatar et al. (2016); Peng et al. (2017); Guestrin et al. (2002), and Zhang & Lesser (2013) train multiple agents to learn a communication protocol, and have shown that commu- nicating agents achieve better rewards at various tasks. Mordatch & Abbeel (2017) and Havrylov & Titov (2017) investigate the possibility of the artificial emergence of language. Coordinated RL by Guestrin et al. (2002) is an earlier work demonstrating the feasibility of structured communication and the agents’ selection of jointly optimal action.
Only DIAL (Foerster et al., 2016) and Zhang & Lesser (2013) explicitly address bandwidth-related concerns. In DIAL, the communication channel of the training environment has a limited bandwidth, such that the agents being trained are urged to establish more resource-efficient communication pro- tocols. The environment in Zhang & Lesser (2013) also has a limited-bandwidth channel in effect, due to the large amount of exchanged information in running a distributed constraint optimization algorithm. Recently, Jiang & Lu (2018) proposes an attentional communication model that allows some agents who request additional information from others to gather observation from neighboring agents. However, they do not explicitly consider the constraints imposed by limited communication bandwidth and/or scheduling due to communication over a shared medium.
To the best of our knowledge, there is no prior work that incorporates an intelligent scheduling entity in order to facilitate inter-agent communication in both a limited-bandwidth and shared medium access scenarios. As outlined in the introduction, intelligent scheduling among learning agents is pivotal in the orchestration of their communication to better utilize the limited available bandwidth as well as in the arbitration of agents contending for shared medium access.
相关工作我们现在讨论相关文献的主体。 Busoniu等。 (2008)和Tan(1993)对MARL进行了广泛的分散执行研究。但是,这些都基于表格方法,因此它们仅限于简单的环境。结合深度学习的发展,已经出现了深度MARL算法(Tampuu等人,2017; Foerster等人,2017a; Lowe等人,2017)。 Tampuu等人。 (2017)使用DQN与独立Q学习的组合。这种独立学习表现不佳,因为每个代理人都将其他人视为环境的一部分而忽略了它们。 Foerster等。 (2017A); Lowe等人。 (2017);古普塔等人。 (2017); Sunehag等人。 (2018年),和Foerster等人。 (2017b)采用分散执行的集中培训框架,使代理能够在没有任何通信的情况下考虑其他代理的策略来学习合作行为。
人们普遍认为,沟通可以进一步增强学习代理人在完成合作任务时的集体智慧。为此,许多论文先前已经研究了通信协议和语言的学习,以便在强化学习中使用多个代理。我们探索与我们的研究最相似的那些。 Foer-ster等。 (2016); Sukhbaatar等。 (2016);彭等人。 (2017); Guestrin等。 (2002)和Zhang&Lesser(2013)训练多个代理来学习通信协议,并且已经表明通信代理在各种任务中获得更好的奖励。 Mordatch&Abbeel(2017)和Havrylov&Titov(2017)研究了人工出现语言的可能性。由Guestrin等人协调的RL。 (2002)是一个早期的工作,展示了结构化沟通的可行性和代理人选择联合最优行动。
只有DIAL(Foerster等,2016)和Zhang&Lesser(2013)明确解决了与带宽相关的问题。在DIAL中,训练环境的通信信道具有有限的带宽,从而促使正在训练的代理建立更加资源有效的通信协议。由于在运行分布式约束优化算法时交换了大量信息,因此Zhang&Lesser(2013)中的环境也具有有限带宽信道。最近,Jiang&Lu(2018)提出了一种注意沟通模型,该模型允许一些要求其他人提供额外信息的代理人收集邻近代理商的观察。然而,由于通过共享介质进行通信,它们没有明确考虑由有限的通信带宽和/或调度所施加的约束。
据我们所知,没有先前的工作包含智能调度实体,以便在有限带宽和共享媒体访问场景中促进代理间通信。如引言中所述,学习代理之间的智能调度在其通信编排中是关键的,以更好地利用有限的可用带宽以及争用共享媒体访问的代理的仲裁。
that selects i’s action based only on what is partially observed by i. The critic is naturally responsible for centralized training, and thus works in a centralized manner. Thus, the critic is allowed to have the global state s as its input, which includes all agents’ observations and extra information from the environment. The role of the critic is to “criticize” individual agent’s actions. This centralized nature of the critic helps in providing more accurate feedback to the individual actors with limited observation horizon. In this case, each agent’s policy, πi, is updated by a variant of (1) as:
只根据i部分观察到的内容来选择i的动作。 评论家自然负责集中培训,因此集中工作。 因此,批评者可以将全球国家作为其输入,其中包括所有代理人的观察和来自环境的额外信息。 评论家的作用是“批评”个体经纪人的行为。 评论家的这种集中性有助于为观察范围有限的个体演员提供更准确的反馈。 在这种情况下,每个代理的策略πi由(1)的变体更新为:
In practical scenarios where agents are typically separated but are able to communicate over a shared medium, e.g., a frequency channel in wireless communications, two important constraints are im- posed: bandwidth and contention for medium access (Rappaport, 2001). The bandwidth constraint entails a limited amount of bits per unit time, and the contention constraint involves having to avoid collision among multiple transmissions due to the natural aspect of signal broadcasting in wireless communication. Thus, only a restricted number of agents are allowed to transmit their messages each time step for a reliable message transfer. In this paper, we use a simple model to incorporate that the aggregate information size per time step is limited by Lband bits and that only Ksched out of n agents may broadcast their messages.
Weight-based Scheduling Noting that distributed execution of agents is of significant importance, there may exist a variety of scheduling mechanisms to schedule Ksched agents in a distributed man- ner. In this paper, we adopt a simple algorithm that is weight-based, which we call WSA (Weight- based Scheduling Algorithm). Once each agent decides its own weight, the agents are scheduled based on their weights following a class of the pre-defined rules. We consider the following two specific ones among many different proposals due to simplicity, but more importantly, good approx- imation of wireless scheduling protocols in practice.
在通常分离代理但能够通过共享介质(例如,无线通信中的频率信道)进行通信的实际场景中,提出了两个重要的约束:带宽和媒体访问的争用(Rappaport,2001)。带宽约束需要每单位时间有限的比特量,并且争用约束涉及由于无线通信中的信号广播的自然方面而必须避免多个传输之间的冲突。因此,每次只允许有限数量的代理发送其消息以进行可靠的消息传送。在本文中,我们使用一个简单的模型来合并每个时间步的聚合信息大小受Lband比特的限制,并且只有N个代理中的Ksched可以广播它们的消息。
基于权重的调度注意到代理的分布式执行非常重要,可能存在各种调度机制来以分布式方式调度Ksched代理。在本文中,我们采用一种基于权重的简单算法,我们称之为WSA(基于权重的调度算法)。一旦每个代理确定其自身权重,就根据一类预定义规则的代码来调度代理。由于简单性,我们在许多不同的提议中考虑以下两个具体的,但更重要的是,在实践中很好地近似了无线调度协议。
Since distributed execution is one of our major operational constraints in SchedNet or other CTDE- based MARL algorithms, Top(k) and Softmax(k) should be realizable via a weight-based mech- anism in a distributed manner. In fact, this has been an active research topic to date in wireless networking, where many algorithms exist (Tassiulas & Ephremides, 1992; Yi et al., 2008; Jiang & Walrand, 2010). Due to space limitation, we present how to obtain distributed versions of those two rules based on weights in our supplementary material. To summarize, using so-called CSMA (Car- rier Sense Multiple Access) (Kurose, 2005), which is a fully distributed MAC scheduler and forms a basis of Wi-Fi, given agents’ weight values, it is possible to implement Top(k) and Softmax(k).
Our goal is to train agents so that every time each agent takes an action, only Ksched agents can broadcast their messages with limited size Lband with the goal of receiving the highest cumulative reward via cooperation. Each agent should determine a policy described by its scheduling weights, encoded communication messages, and actions.
由于分布式执行是我们在SchedNet或其他基于CTDE的MARL算法中的主要操作约束之一,因此Top(k)和Softmax(k)应该通过基于权重的机制以分布式方式实现。事实上,迄今为止,这一直是无线网络中一个活跃的研究课题,其中存在许多算法(Tassiulas&Ephremides,1992; Yi et al。,2008; Jiang&Walrand,2010)。由于篇幅限制,我们将介绍如何根据补充材料中的权重获取这两个规则的分布式版本。总而言之,使用所谓的CSMA(Carrier Sense Multiple Access)(Kurose,2005),它是一个完全分布式的MAC调度程序并构成Wi-Fi的基础,给定代理的权重值,可以实现Top (k)和Softmax(k)。
我们的目标是培训代理商,以便每次代理商采取行动时,只有Ksched代理商可以用有限大小的Lband广播他们的消息,目的是通过合作获得最高的累积奖励。每个代理应确定由其调度权重,编码通信消息和动作描述的策略。
To this end, we propose a new deep MARL framework with scheduled communications, called SchedNet, whose overall architecture is depicted in Figure 1. SchedNet consists of the following three components: (i) actor network, (ii) scheduler, and (iii) critic network. This section is devoted to presenting the architecture only, whose details are presented in the subsequent sections.
Neural networks The actor network is the collection of n per-agent individual actor networks, where each agent i’s individual actor network consists of a triple of the following networks: a
为此,我们提出了一个新的深度MARL框架,其中包含计划通信,称为SchedNet,其整体架构如图1所示.SchedNet由以下三个部分组成:(i)演员网络,(ii)调度程序,以及(iii) 评论家网络。 本节仅用于介绍该体系结构,其详细信息将在后续章节中介绍。
神经网络参与者网络是n个每个代理个体参与者网络的集合,其中每个代理i的个体参与者网络由以下网络的三个组成:a
Coupling: Actor and Scheduler Encoder, weight generator and the scheduler are the modules for handling the constraints of limited bandwidth and shared medium access. Their common goal is to learn the state-dependent “importance” of individual agent’s observation, encoders for generating compressed messages and the scheduler for being used as a basis of an external scheduling mech- anism based on the weights generated by per-agent weight generators. These three modules work together to smartly respond to time-varying states. The action selector is trained to decode the in- coming message, and consequently, to take a good action for maximizing the reward. At every time step, the schedule profile c varies depending on the observation of each agent, so the incoming mes- sage m comes from a different combination of agents. Since the agents can be heterogeneous and they have their own encoder, the action selector must be able to make sense of incoming messages from different senders. However, the weight generator’s policy changes, the distribution of incoming messages also changes, which is in turn affected by the pre-defined WSA. Thus, the action selector should adjust to this changed scheduling. This also affects the encoder in turn. The updates of the encoder and the action selector trigger the update of the scheduler again. Hence, weight generators, message encoders, and action selectors are strongly coupled with dependence on a specific WSA, and we train those three networks at the same time with a common critic.
Scheduling logic The schedule profile c is determined by the WSA module, which is mathemat-
ically a mapping from all agents’ weights w (generated by fi ) to c. Typical examples of these
mappings are Top(k) and Softmax(k), as mentioned above. The scheduler of each agent is trained appropriately depending on the employed WSA algorithm.
耦合:Actor和Scheduler Encoder,权重生成器和调度程序是用于处理有限带宽和共享介质访问的约束的模块。他们的共同目标是学习个体代理观察的状态依赖“重要性”,用于生成压缩消息的编码器以及用作基于每个代理权重生成器生成的权重的外部调度机制的基础的调度器。这三个模块协同工作,巧妙地响应时变状态。操作选择器经过训练以解码收到的消息,从而采取良好的行动以最大化奖励。在每个时间步,计划配置文件c根据每个代理的观察而变化,因此传入消息m来自不同的代理组合。由于代理可以是异构的并且它们具有自己的编码器,因此动作选择器必须能够理解来自不同发送者的传入消息。但是,权重生成器的策略发生变化,传入消息的分布也会发生变化,而这又受到预定义WSA的影响。因此,动作选择器应该适应这种改变的调度。这也会依次影响编码器。编码器和动作选择器的更新再次触发调度程序的更新。因此,权重生成器,消息编码器和动作选择器与对特定WSA的依赖强烈耦合,我们同时用一个共同的批评者训练这三个网络。
调度逻辑调度配置文件c由WSA模块确定,该模块是数学模块
从所有代理的权重w(由fi生成)到c的映射。这些的典型例子
如上所述,映射是Top(k)和Softmax(k)。根据所采用的WSA算法适当地训练每个代理的调度程序。
parametrized by θc to estimate the state value function Vθc (s) for the action selectors and message encoders, and the action-value function Qπ (s, w) for the weight generators. The critic is used only when training, and it can use the global state s, which includes the observation of all agents. All net- works in the actor are trained with gradient-based on temporal difference backups. To share common features between Vθc (s) and Qπ (s, w) and perform efficient training, we use shared parameters in the lower layers of the neural network between the two functions, as shown in Figure 2.
由θc参数化以估计动作选择器和消息编码器的状态值函数Vθc(s),以及权重发生器的动作值函数Qπ(s,w)。 批评者只在训练时使用,它可以使用全局状态,包括观察所有代理人。 演员中的所有网络都使用基于时间差异备份的梯度进行训练。 为了分享Vθc(s)和Qπ(s,w)之间的共同特征并进行有效的训练,我们在两个函数之间的神经网络的较低层使用共享参数,如图2所示。
where s and sj are the global states corresponding to the observations at current and next time step. We can get the value of state Vθc (s) from the centralized critic and then adjust the parameters θu via
gradient ascent accordingly.
其中s和sj是与当前和下一步的观察结果相对应的全局状态。 我们可以从集中评论家得到状态Vθc(s)的值,然后通过调整参数θu
因此梯度上升。
In execution, each agent i should be able to determine the scheduling weight wi, encoded message mi, and action selection ui in a distributed manner. This process must be based on its own obser- vation, and the weights generated by its own action selector, message encoder, and weight genera-
tor with the parameters θi , θi , and θi , respectively. After each agent determines its scheduling
as enc wg
weight, Ksched agents are scheduled by WSA, which leads the encoded messages of scheduled agents to be broadcast to all agents. Finally, each agent finally selects an action by using received messages. This process is sequentially repeated under different observations over time.
在执行中,每个代理i应该能够以分布式方式确定调度权重wi,编码消息mi和动作选择ui。 此过程必须基于其自身的观察,以及由其自己的动作选择器,消息编码器和权重生成的权重 -
分别具有参数θi,θi和θi。 每个代理确定其调度后
作为ENC WG
权重,Ksched代理由WSA安排,它将预定代理的编码消息导向所有代理。 最后,每个代理最终通过使用收到的消息选择一个动作。 随着时间的推移,在不同的观察下依次重复该过程
Environments To evaluate SchedNet2, we consider two different environments for demonstrative purposes: Predator and Prey (PP) which is used in Stone & Veloso (2000), and Cooperative Com- munication and Navigation (CCN) which is the simplified version of the one in Lowe et al. (2017). The detailed experimental environments are elaborated in the following subsections as well as in supplementary material. We take the communication environment into our consideration as follows. k out of all agents can have the chance to broadcast the message whose bandwidth3 is limited by l.
Tested algorithms and setup We perform experiments in aforementioned environments. We com- pare SchedNet with a variant of DIAL,4 (Foerster et al., 2016) which allows communication with limited bandwidth. During the execution of DIAL, the limited number (k) of agents are scheduled following a simple round robin scheduling algorithm, and the agent reuses the outdated messages of non-scheduled agents to make a decision on the action to take, which is called DIAL(k). The other baselines are independent DQN (IDQN) (Tampuu et al., 2017) and COMA (Foerster et al., 2017a) in which no agent is allowed to communicate. To see the impact of scheduling in SchedNet, we
环境为了评估SchedNet2,我们考虑两种不同的环境用于演示目的:用于Stone&Veloso(2000)的Predator和Prey(PP),以及合作通信和导航(CCN)的简化版本。 Lowe等人。 (2017年)。详细的实验环境将在以下小节和补充材料中详细说明。我们将通信环境考虑如下。所有代理中的k都有机会广播其bandwidth3受l限制的消息。
经过测试的算法和设置我们在上述环境中进行实验。我们将SchedNet与DIAL的变体4相比较(Foerster等,2016),它允许以有限的带宽进行通信。在执行DIAL期间,有限数量(k)的代理按照简单的循环调度算法进行调度,并且代理重新使用非调度代理的过时消息来决定要采取的操作,这称为DIAL (K)。其他基线是独立的DQN(IDQN)(Tampuu等,2017)和COMA(Foerster等,2017a),其中不允许代理进行通信。为了了解SchedNet中调度的影响,我们
Figure 3: Learning curves during the learning of the PP and CCN tasks. The plots show the average time taken to complete the task, where shorter time is better for the agents.
compare SchedNet with (i) RR (round robin), which is a canonical scheduling method in communi- cation systems where all agents are sequentially scheduled, and (ii) FC (full communication), which is the ideal configuration, wherein all the agents can send their messages without any scheduling or bandwidth constraints. We also diversify the WSA in SchedNet into: (i) Sched-Softmax(1) and
(ii) Sched-Top(1) whose details are in Section 3.1. We train our models until convergence, and then evaluate them by averaging metrics for 1,000 iterations. The shaded area in each plot denotes 95% confidence intervals based on 6-10 runs with different seeds.
图3:学习PP和CCN任务期间的学习曲线。 这些图显示了完成任务所需的平均时间,其中较短的时间对于代理来说更好。
比较SchedNet与(i)RR(循环法),这是在所有代理按顺序安排的通信系统中的规范调度方法,以及(ii)FC(完全通信),这是理想的配置,其中所有代理 无需任何调度或带宽限制即可发送消息。 我们还将SchedNet中的WSA多样化为:(i)Sched-Softmax(1)和
(ii)Sched-Top(1),详情见3.1节。 我们训练模型直到收敛,然后通过平均1000次迭代的度量来评估它们。 每个图中的阴影区域表示基于具有不同种子的6-10次运行的95%置信区间。
In this task, there are multiple agents who must capture a randomly moving prey. Agents’ observa- tions include position of themselves and the relative positions of prey, if observed. We employ four agents, and they have different observation horizons, where only agent 1 has a 5 5 view while agents 2, 3, and 4 have a smaller, 3 3 view. The predators are rewarded when they capture the prey, and thus the performance metric is the number of time steps taken to capture the prey.
Result in PP Figure 3a illustrates the learning curve of 750,000 steps in PP. In FC, since the agents can use full state information even during execution, they achieve the best performance. SchedNet outperforms IDQN and COMA in which communication is not allowed. It is observed that agents first find the prey, and then follow it until all other agents also eventually observe the prey. An agent successfully learns to follow the prey after it observes the prey but that it takes a long time to meet the prey for the first time. If the agent broadcasts a message that includes the location information of the prey, then other agents can find the prey more quickly. Thus, it is natural that SchedNet and DIAL perform better than IDQN or COMA, because they are trained to work with communication. However, DIAL is not trained for working under medium contention constraints. Although DIAL works well when there is no contention constraints, under the condition where only one agent is scheduled to broadcast the message by a simple scheduling algorithm (i.e., RR), the average number of steps to capture the prey in DIAL(1) is larger than that of SchedNet-Top(1), because the outdated messages of non-scheduled agents is noisy for the agents to decide on actions. Thus, we should consider the scheduling from when we train the agents to make them work in a demanding environment.
Impact of intelligent scheduling In Figure 3b, we observe that IDQN, RR, and SchedNet- Softmax(1) lie more or less on a comparable performance tier, with SchedNet-Softmax(1) as the best in the tier. SchedNet-Top(1) demonstrates a non-negligible gap better than the said tier, im- plying that a deterministic selection improves the agents’ collective rewards the best. In particular, SchedNet-Top(1) improves the performance by 43% compared to RR. Figure 3b lets us infer that, while all the agents are trained under the same conditions except for the scheduler, the difference in the scheduler is the sole determining factor for the variation in the performance levels. Thus, ablating away the benefit from smart encoding, the intelligent scheduling element in SchedNet can be accredited with the better performance.
Weight-based Scheduling We attempt to explain the internal behavior of SchedNet by investi- gating instances of temporal scheduling profiles obtained during the execution. We observe that SchedNet has learned to schedule those agents with a farther observation horizon, realizing the ra- tionale of importance-based assignment of scheduling priority also for the PP scenario. Recall that Agent 1 has a wider view and thus tends to obtain valuable observation more frequently. In Figure 4, we see that scheduling chances are distributed over (14, 3, 4, 4) where corresponding average weights are (0.74, 0.27, 0.26, 0.26), implying that those with greater observation power tend to be scheduled more often.
在此任务中,有多个代理必须捕获随机移动的猎物。代理人的观察包括他们自己的位置和猎物的相对位置,如果观察到的话。我们使用了四个代理,它们具有不同的观察视野,其中只有代理1具有5 5视图而代理2,3和4具有较小的视图3 3视图。掠食者在捕获猎物时会得到奖励,因此性能指标是捕获猎物所需的时间步数。
** PP中的结果**图3a显示了PP中750,000步的学习曲线。在FC中,由于代理可以在执行期间使用完整的状态信息,因此它们可以实现最佳性能。 SchedNet优于不允许通信的IDQN和COMA。据观察,试剂首先找到猎物,然后跟随它直到所有其他试剂最终都观察到猎物。一个代理人在观察猎物后成功学会跟踪猎物,但是第一次遇到猎物需要很长时间。如果代理广播包含猎物的位置信息的消息,则其他代理可以更快地找到猎物。因此,SchedNet和DIAL自然比IDQN或COMA表现更好,因为它们经过培训可以与通信协同工作。但是,DIAL没有接受过在中等争用限制下工作的培训。虽然DIAL在没有争用约束的情况下运行良好,但在只安排一个代理通过简单的调度算法(即RR)广播消息的情况下,在DIAL(1)中捕获猎物的平均步数是大于SchedNet-Top(1),因为非调度代理的过时消息对于代理决定操作是有噪声的。因此,我们应该考虑从培训代理商到让他们在苛刻的环境中工作的时间安排。
智能调度的影响在图3b中,我们观察到IDQN,RR和SchedNet-Softmax(1)或多或少地位于可比较的性能层上,SchedNet-Softmax(1)是层中最好的。 SchedNet-Top(1)表现出比上述层次更好的差距,表明确定性选择最能提高代理商的集体回报。特别是,与RR相比,SchedNet-Top(1)将性能提高了43%。图3b让我们推断,虽然除了调度程序之外,所有代理都在相同条件下进行训练,但调度程序中的差异是性能级别变化的唯一决定因素。因此,消除智能编码带来的好处,SchedNet中的智能调度元素可以获得更好的性能认证。
基于权重的调度我们尝试通过调查执行期间获得的时间调度配置文件的实例来解释SchedNet的内部行为。我们观察到SchedNet已经学会了在更远的观察范围内安排这些代理,同时也为PP场景实现了基于重要性的调度优先级分配的概率。回想一下,Agent 1具有更广泛的视野,因此更容易获得有价值的观察。在图4中,我们看到调度机会分布在(14,3,4,4),其中相应的平均权重为(0.74,0.27,0.26,0.26),这意味着具有更高观察能力的那些往往更频繁地被安排。
Message encoding We now attempt to understand what the predator agents communicate when performing the task. Figure 5 shows the projections of the messages onto a 2D plane, which is generated by the scheduled agent under SchedNet Top(1) with l = 2. When the agent does not observe the prey (blue circle in Figure), most of the messages reside in the bottom or the left partition of the plot. On the other hand, the messages have large variance when it observes the prey (red ‘x’). This is because the agent should transfer more informative messages that implicitly include the location of the prey, when it observes the prey. Further analysis of the messages is presented in our supplementary material.
消息编码我们现在尝试了解执行任务时捕食者代理的通信。 图5显示了消息在2D平面上的投影,这是由SchedNet Top(1)下的预定代理生成的,其中l = 2.当代理没有观察到猎物时(图中的蓝色圆圈),大多数消息 位于图的底部或左侧分区。 另一方面,当观察到猎物(红色’x’)时,消息具有很大的差异。 这是因为当代理观察猎物时,代理应该传递更多信息性消息,这些消息隐含地包括猎物的位置。 我们的补充材料中提供了对这些信息的进一步分析。
In this task, each agent’s goal is to arrive at a pre-specified destination on its one-dimensional world, and they collect a joint reward when both agents reach their respective destina- tion. Each agent has a zero observation horizon around itself, but it can observe the situation of the other agent. We introduce heterogeneity into the scenario, where the agent-destination distance at the beginning of the task differs across agents. The metric used to gauge the performance is the number of time steps taken to complete the CCN task.
Result in CCN We examine the CCN environment whose results are shown in Figure 3c. Sched- Net and other baselines were trained for 200,000 steps. As expected, IDQN takes the longest time, and FC takes the shortest time. RR exhibits mediocre performance, better than IDQN, because agents at least take turns in obtaining the communication opportunity. Of particular interest is Sched- Net, outperforming both IDQN and RR with a non-negligible gap. We remark that the deterministic selection with SchedNet-Top(1) slightly beats the probabilistic counterpart, SchedNet-Softmax(1). The 32% improved gap between RR and SchedNet clearly portrays the effects of intelligent schedul- ing, as the carefully learned scheduling method of SchedNet was shown to complete the CCN task faster than the simplistic RR.
Scheduling in CCN As Agent 2 is farther from its destination than Agent 1, we observe that Agent 1 is scheduled more frequently to drive Agent 2 to its destination (7 vs. 18), as shown in Figure 6. This evidences that SchedNet flex- ibly adapts to heterogeneity of agents via scheduling. To- wards more efficient completion of the task, a rationale of more scheduling for more important agents should be implemented. This is in accordance with the results obtained from PP environments: more important agents are scheduled more.
在此任务中,每个代理的目标是在其一维世界中到达预先指定的目的地,并且当两个代理到达其各自的目的地时,他们收集联合奖励。每个代理都有自己的零观察范围,但它可以观察其他代理的情况。我们在场景中引入异构性,其中任务开始时的代理 - 目标距离因代理而异。用于衡量性能的指标是完成CCN任务所花费的时间步数。
结果CCN我们检查CCN环境,其结果如图3c所示。 Sched-Net和其他基线经过200,000步骤的培训。正如预期的那样,IDQN花费的时间最长,FC花费的时间最短。 RR表现出平庸的表现,优于IDQN,因为代理商至少轮流获得沟通机会。特别感兴趣的是Sched-Net,其表现优于IDQN和RR,且差距不可忽略。我们注意到SchedNet-Top(1)的确定性选择略微超过了概率对应物SchedNet-Softmax(1)。 RR与SchedNet之间的差距缩小32%,清楚地描述了智能调度的影响,因为SchedNet的精心学习的调度方法比简单的RR更快地完成了CCN任务。
CCN中的调度由于代理2距离其目的地比代理1更远,我们观察到代理1被更频繁地调度以将代理2驱动到其目的地(7对18),如图6所示。该证据SchedNet通过调度灵活地适应代理的异构性。为了更有效地完成任务,应该为更重要的代理实现更多调度的基本原理。这与PP环境的结果一致:更重要的代理商安排得更多。
We have proposed SchedNet for learning to schedule inter-agent communications in fully- cooperative multi-agent tasks. In SchedNet, we have the centralized critic giving feedback to the actor, which consists of message encoders, action selectors, and weight generators of each individ- ual agent. The message encoders and action selectors are criticized towards compressing observa- tions more efficiently and selecting actions that are more rewarding in view of the cooperative task at hand. Meanwhile, the weight generators are criticized such that k agents with apparently more valuable observation are allowed to access the shared medium and broadcast their messages to all other agents. Empirical results and an accompanying ablation study indicate that the learnt encoding and scheduling behavior each significantly improve the agents’ performance. We have observed that an intelligent, distributed communication scheduling can aid in a more efficient, coordinated, and rewarding behavior of learning agents in the MARL setting.
我们已经提出SchedNet用于学习在完全协作的多代理任务中调度代理间通信。在SchedNet中,我们有集中的评论家向演员提供反馈,演员由每个个体代理的消息编码器,动作选择器和权重生成器组成。消息编码器和动作选择器被批评为更有效地压缩观察并选择在手头的合作任务中更有价值的行动。同时,权重发生器受到批评,使得具有明显更有价值观察的k个代理被允许访问共享媒体并将其消息广播给所有其他代理。经验结果和伴随的消融研究表明,学习的编码和调度行为每个都显着改善了代理的表现。我们已经观察到,智能的分布式通信调度可以在MARL设置中帮助学习代理的更有效,协调和有益的行为。
This work was supported by Institute for Information communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No.2018-0-00170, Virtual Presence in Moving Ob- jects through 5G)
这项工作得到了由韩国政府(MSIT)资助的信息通信技术促进研究所(IITP)资助(No.2018-0-00170,通过5G移动对象的虚拟存在)
REFERENCES
Herve´ Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4-5):291–294, 1988.
Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, And Cybernetics, 2008.
Adam Dunkels, Bjorn Gronvall, and Thiemo Voigt. Contiki-a lightweight and flexible operating system for tiny networked sensors. In Proceedings of Local Computer Networks (LCN), 2004.
Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Proceedings of Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017a.
Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Philip Torr, Pushmeet Kohli, Shimon Whiteson, et al. Stabilising experience replay for deep multi-agent reinforcement learning. arXiv preprint arXiv:1702.08887, 2017b.
Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arxiv preprint arXiv:1709.04326, 2017c.
Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017.
Carlos Guestrin, Michail Lagoudakis, and R Parr. Coordinated reinforcement learning. In Proceed- ings of International Conference on Machine Learning, 2002.
Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, 2017.
Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: learning to com- municate with sequences of symbols. In Proceedings of Advances in Neural Information Process- ing Systems, pp. 2146–2156, 2017.
Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Proceedings of Advances in Neural Information Processing Systems, pp. 3–10, 1994.
Hyeryung Jang, Se-Young Yun, Jinwoo Shin, and Yung Yi. Distributed learning for utility max- imization over csma-based wireless multihop networks. In Proceedings of IEEE INFOCOM, 2014.
Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation.
arXiv preprint arXiv:1805.07733, 2018.
Libin Jiang and Jean Walrand. A distributed csma algorithm for throughput and utility maximization in wireless networks. IEEE/ACM Transactions on Networking (ToN), 18(3):960–972, 2010.
Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. Control Optimization, 42(4):1143– 1166, 2003.
James F Kurose. Computer networking: A top-down approach featuring the internet. Pearson Education India, 2005.
Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of International Conference on Autonomous Agents and MultiAgent Systems, pp. 464–473, 2017.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor- critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.
Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with pensieve. In Proceedings of ACM Sigcomm, pp. 197–210, 2017.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of International Conference on Machine Learning, pp. 1928–1937, 2016.
Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. arxiv preprint arXiv:1703.10069, 2017.
Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016.
Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. Deep decentralized multi-task multi-agent RL under partial observability. arXiv preprint arXiv:1703.06182, 2017.
Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep rein- forcement learning. arxiv preprint arXiv:1707.04402, 2017.
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arxiv preprint arXiv:1703.10069, 2017.
Theodore Rappaport. Wireless Communications: Principles and Practice. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2nd edition, 2001. ISBN 0130422320.
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent rein- forcement learning. In Proceedings of International Conference on Machine Learning, 2018.
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of International Conference on Machine Learning, 2014.
Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspec- tive. Autonomous Robots, 8(3):345–383, 2000.
Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropaga- tion. In Proceedings of Advances in Neural Information Processing Systems, pp. 2244–2252, 2016.
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vin´ıcius Flores Zam- baldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning based on team re- ward. In Proceeding of AAMAS, 2018.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cam- bridge, 1998.
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of Advances in Neural Information Processing Systems, 2000.
Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.
Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of International Conference on Machine Learning, pp. 330–337, 1993.
Leandros Tassiulas and Anthony Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE transactions on automatic control, 37(12):1936–1948, 1992.
Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3): 58–68, 1995.
Yung Yi, Alexandre Proutie`re, and Mung Chiang. Complexity in wireless scheduling: Impact and tradeoffs. In Proceedings of ACM Mobihoc, 2008.
Chongjie Zhang and Victor Lesser. Coordinating multi-agent reinforcement learning with limited communication. In Proceedings of International conference on Autonomous agents and multi- agent systems, 2013.
The training algorithm for SchedNet is provided in Algorithm 1. The parameters of the message
Predator and prey We assess SchedNet in this predator-prey setting as in Stone & Veloso (2000), illustrated in Figure 7a. This setting involves a discretized grid world and multiple cooperating predators who must capture a randomly moving prey. Agents’ observations include position of themselves and the relative positions of the prey, if observed. The observation horizon of each predator is limited, thereby emphasizing the need for communication. The termination criterion for the task is that all agents observe the prey, as in the right of Figure 7a. The predators are rewarded when the task is terminated. We note that agents may be endowed with different observation hori- zons, making them heterogeneous. We employ four agents in our experiment, where only agent 1 has a 5 5 view while agents 2, 3, and 4 have a smaller, 3 3 view. The performance metric is the number of time steps taken to capture the prey.
Cooperative communication and navigation We adopt and modify the cooperative communica- tion and navigation task in Lowe et al. (2017), where we test SchedNet in a simple one-dimensional grid as in Figure 7b. In CCN, each of the two agents resides in its one-dimensional grid world. Each agent’s goal is to arrive at a pre-specified destination (denoted by the square with a star or a heart for Agents 1 and 2, respectively), and they collect a joint reward when both agents reach their target destination. Each agent has a zero observation horizon around itself, but it can observe the situation of the other agent. We introduce heterogeneity into the scenario, where the agent-destination dis- tance at the beginning of the task differs across agents. In our example, Agent 2 is initially located at a farther place from its destination, as illustrated in Figure 7b. The metric used to gauge the performance of SchedNet is the number of time steps taken to complete the CCN task.
捕食者和猎物我们在Stone&Veloso(2000)中对捕食者 - 猎物设置中的SchedNet进行评估,如图7a所示。此设置涉及离散网格世界和多个合作捕食者,他们必须捕获随机移动的猎物。如果观察到,特工的观察包括他们自己的位置和猎物的相对位置。每个捕食者的观察范围是有限的,从而强调了对交流的需求。该任务的终止标准是所有代理都观察到猎物,如图7a右侧所示。当任务终止时,掠夺者会得到奖励。我们注意到,代理人可能被赋予不同的观察水平,使他们变得异质。我们在实验中使用了四个代理,其中只有代理1具有5 5视图,而代理2,3和4具有较小的3 3视图。性能指标是捕获猎物所需的时间步数。
合作通信和导航我们采用并修改了Lowe等人的合作通信和导航任务。 (2017),我们在一个简单的一维网格中测试SchedNet,如图7b所示。在CCN中,两个代理中的每一个都驻留在其一维网格世界中。每个代理的目标是到达预先指定的目的地(分别用代理1和2的星形或心形的方块表示),并且当两个代理到达目标目的地时,它们收集联合奖励。每个代理都有自己的零观察范围,但它可以观察其他代理的情况。我们在场景中引入了异构性,其中任务开始时的代理 - 目的地距离因代理而异。在我们的示例中,代理2最初位于距其目的地更远的位置,如图7b所示。用于衡量SchedNet性能的指标是完成CCN任务所需的时间步数。
Table 1 shows the values of the hyperparameters for the CCN and the PP task. We use Adam opti- mizer to update network parameters and soft target update to update target network. The structure of the networks is the same across tasks. For the critic, we used three hidden layers, and the critic between the scheduler and the action selector shares the first two layers. For the actor, we use one hidden layer; for the encoder and the weight generator, three hidden layers each. Networks use rec- tified linear units for all hidden layers. Because the complexity of the two tasks differ, we sized the hidden layers differently. The actor network and the critic network for the CCN have hidden layers with 8 units and 16 units, respectively. The actor network and the critic network for the PP have hidden layers with 32 units and 64 units, respectively.
表1显示了CCN和PP任务的超参数的值。 我们使用Adam optimizer来更新网络参数和软目标更新以更新目标网络。 跨任务的网络结构是相同的。 对于评论家来说,我们使用了三个隐藏层,调度器和动作选择器之间的批评者共享前两层。 对于演员,我们使用一个隐藏层; 对于编码器和重量发生器,每个都有三个隐藏层。 网络对所有隐藏层使用重新确定的线性单位。 由于两个任务的复杂性不同,我们对隐藏层的大小不同。 CCN的演员网络和评论网络分别隐藏了8个单元和16个单元的层。 PP的演员网络和评论网络分别具有32个单元和64个单元的隐藏层。
Figure 8: Performance evaluation of SchedNet. The graphs show the average time taken to complete the task, where shorter time is better for the agents.
图8:SchedNet的性能评估。 图表显示完成任务所需的平均时间,其中代理的时间越短越好。
C.1PREDATOR AND PREY
Impact of bandwidth (L) and number of schedulable agents (K) Due to communication con- straints, only k agents can communicate and scheduled agents can broadcast their message, each of which has a limited size l due to bandwidth constraints. We see the impact of l and k on the perfor- mance in Figure 8a. As L increases, more information can be encoded into the message, which can be used by other agents to take action. Since the encoder and the actor are trained to maximize the shared goal of all agents, they can achieve higher performance with increasing l. In Figure 8b, we compare the cases where k = 1, 2, 3, and FC in which all agents can access the medium, with l = 1. As we can expect, the general tendency is that the performance grows as k increases.
带宽(L)和可调度代理数量(K)的影响由于通信限制,只有k个代理可以进行通信,并且预定的代理可以广播其消息,每个消息由于带宽限制而具有有限的大小l。 我们看到l和k对图8a中的性能的影响。 随着L的增加,可以将更多信息编码到消息中,其他代理可以使用它来采取行动。 由于编码器和演员经过训练以最大化所有代理的共同目标,因此他们可以通过增加l来实现更高的性能。 在图8b中,我们比较了k = 1,2,3和FC的情况,其中所有代理都可以访问介质,l = 1.正如我们所预期的,总的趋势是性能随着k的增加而增长。
Impact of joint scheduling and encoding To study the effect of jointly coupling scheduling and encoding, we devise a comparison against a pre-trained auto-encoder (Bourlard & Kamp, 1988; Hinton & Zemel, 1994). An auto-encoder was trained ahead of time, and the encoder part of this auto-encoder was placed in the Actor’s ENC module in Figure 1. The encoder part is not trained further while training the other parts of network. Henceforth, we name this modified Actor “AE”. Figure 8c shows the learning curve of AE and other baselines. Table 2 highlights the impact of joint scheduling and encoding. The numbers shown are the performance metric normalized to the FC case in the PP environment. While SchedNet-Top(1) took only 2.030 times as long as FC to finish the PP task, the AE-equipped actor took 3.408 times as long as FC. This lets us ascertain that utilizing a pre-trained auto-encoder deprives the agent of the benefit of joint the scheduler and encoder neural network in SchedNet.
What messages agents broadcast In Section 4.1, we attempted to understand what the predator agents communicate when performing PP task where k = 1 and l = 2. In this section, we look into the message in detail. Figure 9 shows the projections of the messages generated by the scheduled agent based on its own observation. In the PP task, the most important information is the location of the prey, and this can be estimated from the observation of other agents. Thus, we are interested in the location information of the prey and other agents. We classify the message into four classes based on which quadrant the prey and the predator are included, and mark each class with different colors. Figure 9a shows the messages for different relative location of prey for agents’ observation, and Figure 9b shows the messages for different locations of the agent who sends the message. We can observe that there is some general trend in the message according to the class. We thus conclude that if the agents observe the prey, they encode into the message the relevant information that is helpful to estimate the location of the prey. The agents who receive this message interpret the message to select action.
联合调度和编码的影响为了研究联合调度和编码的效果,我们设计了一个与预训练的自动编码器的比较(Bourlard&Kamp,1988; Hinton&Zemel,1994)。提前训练了自动编码器,并将该自动编码器的编码器部分放置在图1中的Actor的ENC模块中。在训练网络的其他部分时,编码器部分未经过进一步训练。此后,我们将此修改后的Actor命名为“AE”。图8c显示了AE和其他基线的学习曲线。表2突出显示了联合调度和编码的影响。显示的数字是在PP环境中标准化为FC情况的性能指标。虽然SchedNet-Top(1)的完成PP任务的时间仅为FC的2.030倍,但装备AE的演员的时间是FC的3.408倍。这让我们确定利用预先训练的自动编码器剥夺了代理在SchedNet中联合调度器和编码器神经网络的好处。
代理广播的消息在4.1节中,我们试图了解捕食者代理在执行PP任务时所传达的信息,其中k = 1且l = 2.在本节中,我们将详细研究消息。图9显示了由预定代理根据自己的观察生成的消息的预测。在PP任务中,最重要的信息是猎物的位置,这可以通过观察其他代理来估计。因此,我们对猎物和其他代理的位置信息感兴趣。我们根据包含猎物和捕食者的象限将消息分为四类,并用不同的颜色标记每个类。图9a显示了代理观察的猎物的不同相对位置的消息,图9b显示了发送消息的代理的不同位置的消息。我们可以观察到,根据班级,信息中有一些大趋势。因此,我们得出结论,如果代理人观察到猎物,他们会在信息中编码有助于估计猎物位置的相关信息。收到此消息的代理会将消息解释为选择操作。
In MARL, partial observability issue is one of the major problems, and there are two typical ways to tackle this issue. First, using RNN structure to indirectly remember the history can alleviate the partial observability issues. Another way is to use the observations of other agents through communication among them. In this paper, we focused more on the latter because the goal of this paper is to show the importance of learning to schedule in a practical communication environment in which the shared medium contention is inevitable.
Enlarging the observation through communication is somewhat orthogonal to considering temporal correlation. Thus, we can easily merge SchedNet with RNN which can be appropriate to some partially observable environments.
We add one GRU layer into each of individual encoder, action selector, and weight generator of each agent, where each GRU cell has 64 hidden nodes.
Figure 10 shows the result of applying RNN. We imple- ment IDQN with RNN, and the results show that the average steps to complete tasks of IDQN with RNN is slightly smaller than that of IDQN with feed-forward network. In this case, RNN helps to improve the performance by tackling the partial observable issue. On the other hand, SchedNet-RNN and SchedNet achieve similar performance. We think that the communication in SchedNet somewhat resolves the partial observ- able issues, so the impact of considering temporal correlation with RNN is relatively small. Al- though applying RNN to SchedNet is not really that helpful in this simple environment, we expect that in a more complex environment, using the recurrent connection is more helpful.
在MARL中,部分可观察性问题是主要问题之一,有两种典型的方法可以解决这个问题。首先,使用RNN结构间接记住历史可以缓解部分可观察性问题。另一种方法是通过它们之间的通信来使用其他代理的观察结果。在本文中,我们更多地关注后者,因为本文的目的是展示学习在实际通信环境中安排的重要性,在这种环境中共享媒体争用是不可避免的。
通过通信扩大观察与考虑时间相关性有些正交。因此,我们可以轻松地将SchedNet与RNN合并,这可能适合某些部分可观察的环境。
我们在每个代理的每个编码器,动作选择器和权重生成器中添加一个GRU层,其中每个GRU单元具有64个隐藏节点。
图10显示了应用RNN的结果。我们使用RNN实现IDQN,结果表明,使用RNN完成IDQN任务的平均步骤略小于具有前馈网络的IDQN。在这种情况下,RNN通过解决部分可观察问题来帮助提高性能。另一方面,SchedNet-RNN和SchedNet实现了类似的性能。我们认为SchedNet中的通信在某种程度上解决了部分可观察的问题,因此考虑与RNN的时间相关性的影响相对较小。虽然在这个简单的环境中将RNN应用于SchedNet并不是那么有用,但我们希望在更复杂的环境中使用循环连接更有帮助。
Result in CCN Figure 11 illustrates the learning curve of 200,000 steps in CCN. In FC, since all agents can broad- cast their message during execution, they achieve the best performance. IDQN and COMA in which no communication is allowed, take a longer time to complete the task compared to other baselines. The performances of both are similar because no cooperation can be achieved without the exchange of observations in this environment.
As expected, SchedNet and DIAL outperform IDQN and COMA. Although DIAL works well when there is no contention constraint, under the contention constraint, the av- erage number of steps to complete the task in DIAL(1) is larger than that of SchedNet-Top(1). This result shows the same tendency with the result in PP environment.
CCN中的结果图11显示了CCN中200,000步的学习曲线。 在FC中,由于所有代理都可以在执行期间广播其消息,因此它们可以实现最佳性能。 与其他基线相比,IDQN和COMA不允许通信,完成任务需要更长的时间。 两者的表现相似,因为如果没有在这种环境中交换观察,就无法实现合作。
正如预期的那样,SchedNet和DIAL的表现优于IDQN和COMA。 虽然DIAL在没有争用约束时运行良好,但在争用约束下,在DIAL(1)中完成任务的平均步数大于SchedNet-Top(1)的步数。 该结果显示出与PP环境相同的趋势。
Issues. The role of the scheduler is to consider the constraint due to accessing a shared medium, so that only k < n agents may broadcast their encoded messages. k is determined by the wireless communication environment. For example, under a single wireless channel environment where each agent is located in other agents’ interference range, k = 1. Although the number of agents that can be simultaneously scheduled is somewhat more complex, we abstract it with a single number k because the goal of this paper lies in studying the importance of considering scheduling constraints.
There are two key challenges in designing the scheduler: (i) how to schedule agents in a distributed manner for decentralized execution, and (ii) how to strike a good balance between simplicity in implementation and training, and the integrity of reflecting the current practice of MAC (Medium Access Control) protocols.
问题。 调度程序的作用是考虑由于访问共享介质而产生的约束,以便只有k
Weight-based scheduling To tackle the challenges addressed in the previous paragraph, we pro- pose a scheduler, called weight-based scheduler (WSA), that works based on each agent’s individ- ual weight coming from its observation. As shown in Figure 12, the role of WSA is to map from w = [wi]n to c. This scheduling is extremely simple, but more importantly, highly amenable to the philosophy of distributed execution. The remaining checkpoint is whether this principle is capable of efficiently approximating practical wireless scheduling protocols. To this end, we consider the following two weight-based scheduling algorithms among many different protocols that could be devised:
基于权重的调度为了应对前一段中提到的挑战,我们提出了一种称为基于权重的调度程序(WSA)的调度程序,它基于每个代理来自其观察的个体权重。 如图12所示,WSA的作用是从w = [wi] n映射到c。 这种调度非常简单,但更重要的是,它非常适合分布式执行的理念。 剩下的检查点是该原理是否能够有效地近似实际的无线调度协议。 为此,我们考虑了可以设计的许多不同协议中的以下两种基于权重的调度算法:
Top(k) can be a nice abstraction of the MaxWeight (Tassiulas & Ephremides, 1992) scheduling principle or its distributed approximation (Yi et al., 2008), in which case it is known that different choices of weight values result in achieving different performance metrics, e.g., using the amount of messages queued for being transmitted as weight. Softmax(k) can be a simplified model of CSMA (Carrier Sense Multiple Access), which forms a basis of 802.11 Wi-Fi. Due to space limitation, we refer the reader to Jiang & Walrand (2010) for detail. We now present how Top(k) and Softmax(k) work.
Top(k)可以很好地抽象出MaxWeight(Tassiulas&Ephremides,1992)调度原理或其分布式近似(Yi et al。,2008),在这种情况下,已知不同的权重值选择导致实现不同 性能指标,例如,使用排队等待作为权重发送的消息量。 Softmax(k)可以是CSMA(载波侦听多路访问)的简化模型,其形成802.11 Wi-Fi的基础。 由于篇幅限制,我们会向读者介绍Jiang&Walrand(2010)的详细信息。 我们现在介绍Top(k)和Softmax(k)的工作原理。
CSMA is the one of typical distributed MAC scheduling in wireless communication system. To show the feasibility of scheduling Top(k) and Softmax(k) in a distributed manner, we will explain the variant of CSMA. In this section, we first present the concept of CSMA.
How does CSMA work? The key idea of CSMA is “listen before transmit”. Under a CSMA algorithm, prior to trying to transmit a packet, senders first check whether the medium is busy or idle, and then transmit the packet only when the medium is sensed as idle, i.e., no one is using the
channel. To control the aggressiveness of such medium access, each sender maintains a backoff timer, which is set to a certain value based on a pre-defined rule. The timer runs only when the medium is idle, and stops otherwise. With the backoff timer, links try to avoid collisions by the following procedure:
Each sender does not start transmission immediately when the medium is sensed idle, but keeps silent until its backoff timer expires.
After a sender grabs the channel, it holds the channel for some duration, called the holding time.
Depending on how to choose the backoff and holding times, there can be many variants of CSMA that work for various purposes such as fairness and throughput. Two examples of these, Top(k) and Softmax(k), are introduced in the following sections.
CSMA是无线通信系统中典型的分布式MAC调度之一。为了说明以分布式方式调度Top(k)和Softmax(k)的可行性,我们将解释CSMA的变体。在本节中,我们首先介绍CSMA的概念。
CSMA如何运作? CSMA的关键思想是“在发送前收听”。在CSMA算法下,在尝试发送分组之前,发送者首先检查介质是忙还是空闲,然后仅在介质被感测为空闲时发送分组,即没有人使用
渠道。为了控制这种媒体访问的积极性,每个发送者维护一个退避定时器,该退避定时器根据预定义的规则设置为某个值。定时器仅在介质空闲时运行,否则停止。使用退避计时器,链接尝试通过以下过程避免冲突:
当介质被感测为空闲时,每个发送器不立即开始传输,但是在其退避定时器到期之前保持静默。
在发送者抓取信道之后,它将信道保持一段时间,称为保持时间。
根据如何选择退避和保持时间,可以有许多CSMA变体可用于各种目的,例如公平性和吞吐量。其中两个例子,Top(k)和Softmax(k),将在以下章节中介绍。
In this subsection, we introduce a simple distributed scheduling algorithm, called Distributed Top(k), which can work with SchedNet-Top(k). It is based on CSMA where each sender determines backoff and holding times as follows. In SchedNet, each agent generates the scheduling weight w based on its own observation. The agent sets its backoff time as 1 w where w is its schedule weight, and it waits for backoff time before it tries to broadcast its message. Once it successfully broadcasts the message, it immediately releases the channel. Thus, the agent with the highest w can grab the channel in a decentralized manner without any message passing. By repeating this for k times, we can realize decentralized Top(k) scheduling.
To show the feasibility of distributed scheduling, we implemented the Distributed Top(k) on Con- tiki network simulator (Dunkels et al., 2004) and run the trained agents for the PP task. In our experiment, Top(k) agents are successfully scheduled 98% of the time, and the 2% failures are due to probabilistic collisions in which one of the colliding agents is randomly scheduled by the de- fault collision avoidance mechanism implemented in Contiki. In this case, agents achieve 98.9% performance compared to the case where Top(k) agents are ideally scheduled.
在本小节中,我们介绍了一种简单的分布式调度算法,称为Distributed Top(k),它可以与SchedNet-Top(k)一起使用。它基于CSMA,其中每个发送者确定退避和保持时间如下。在SchedNet中,每个代理根据自己的观察生成调度权重w。代理将其退避时间设置为1 w,其中w是其计划权重,并在尝试广播其消息之前等待退避时间。一旦成功广播该消息,它立即释放该频道。因此,具有最高w的代理可以以分散的方式获取信道而不传递任何消息。通过重复这个k次,我们可以实现分散的Top(k)调度。
为了展示分布式调度的可行性,我们在Con- tiki网络模拟器上实现了分布式Top(k)(Dunkels等,2004),并为PP任务运行训练有素的代理。在我们的实验中,Top(k)代理在98%的时间成功调度,2%的失败是由于概率冲突导致其中一个冲突代理由Contiki中实现的默认冲突避免机制随机调度。在这种情况下,与理想地安排Top(k)代理的情况相比,代理实现98.9%的性能。
In this section, we explain the relation between Softmax(k) and the existing CSMA-based wireless MAC protocols, called oCSMA. When we use Softmax(k) in the case of k = 1, the scheduling algorithm directly relates to the channel selection probability of oCSMA algorithms. First, we explain how it works and show that the resulting channel access probability has a same form with Softmax(k).
How does oCSMA work? It is also based on the basic CSMA algorithm. Once each agent gen- erates its scheduling weight wi, it sets bi and hi to satisfy wi = log(bihi). It sets its backoff and holding times following exponential distributions with means 1/bi and hi, respectively. Based on these backoff and holding times, each agent runs the oCSMA algorithm. In this case, if all agents are in the communication range, the probability that agent i is scheduled over time is as follows:
在本节中,我们将解释Softmax(k)与现有的基于CSMA的无线MAC协议(称为oCSMA)之间的关系。 当我们在k = 1的情况下使用Softmax(k)时,调度算法直接与oCSMA算法的信道选择概率相关。 首先,我们解释它是如何工作的,并表明所得到的信道访问概率与Softmax(k)具有相同的形式。
oCSMA如何运作? 它也基于基本的CSMA算法。 一旦每个代理生成其调度权重wi,它将bi和hi设置为满足wi = log(bihi)。 它使用均值1 / bi和hi分别在指数分布后设置其退避和保持时间。 基于这些退避和保持时间,每个代理运行oCSMA算法。 在这种情况下,如果所有代理都在通信范围内,则代理i随时间安排的概率如下: