目录
图嵌入是一种从图中生成无监督节点特征(node features)的方法,生成的特征可以应用在各类机器学习任务上。现代的图网络,尤其是在工业应用中,通常会包含数十亿的节点(node)和数万亿的边(edge)。这已经超出了已知嵌入系统的处理能力。Facebook开源了一种嵌入系统,PyTorch-BigGraph(PBG),系统对传统的多关系嵌入系统做了几处修改让系统能扩展到能处理数十亿节点和数万亿条边的图形。
本系列为翻译的pytouch的官方手册,希望能帮助大家快速入门GNN及其使用,全文十五篇,文中如果有勘误请随时联系。
(一)Facebook开源图神经网络-Pytorch Biggraph
(二)Facebook:BigGraph 中文文档-数据模型(PyTorch)
(三)Facebook:BigGraph 中文文档-从实体嵌入到边分值(PyTorch)
(四)Facebook:BigGraph 中文文档-I/O格式化(PyTorch)
(五)Facebook:BigGraph 中文文档-批预处理
Distributed Mode 分布式模式
源链接:https://torchbiggraph.readthedocs.io/en/latest/distributed_training.html
PBG can perform training across multiple machines which communicate over a network, in order to reduce training time on large graphs. Distributed training is able to concurrently utilize larger computing resources, as well as to keep the entire model stored in memory across all machines, avoiding the need to swap it to disk. On each machine, the training is further parallelized across multiple subprocesses.
PBG可以跨多台机器训练,通过网络进行机器间通信,以减少大型图形上的训练时长。分布式训练可以同时利用更大的计算资源,同时将整个模型跨机器存储在内存中,避免了内存和硬盘间的数据交换。训练在每个机器上多个子流程间进一步的并行化。
Setup 启动
In order to perform distributed training, the configuration file must first be updated to contain the specification of the desired distributed setup. If training should be carried out on NN machines, then the num_machines key in the config must be set to that value. In addition, thedistributed_init_method must describe a way for the trainers to discover each other and set up their communication. All valid values for the init_methodargument of torch.distributed.init_process_group() are accepted here. Usually this will be a path to a shared network filesystem or the network address of one of the machines. See the PyTorch docs for more information and a complete reference.
若要执行分布式的训练,我们首先得更新包含在分布式配置中的必须的一些配置项。如果训练要在N个机器上执行,那在配置中的num_machines配置项应该要正确配置。另外,训练机器相互之间发现和通信模式的设置通过distributed_init_method来描述。所有torch.distributed.init_process_group()可接收的参数都是init_methed有效参数。通常这是一个共享网络文件系统的路径或其中一台计算机的网络地址。相关的信息和完整参考,请参阅pytorch文档。
To launch distributed training, call torchbiggraph_train --rank rank config.py on each machine, with rank replaced by an integer between 0 and N−1N−1 (inclusive), different for each machine. Each machine must have PBG installed and have a copy of the config file.
启动分布式训练,在每台机器上启动torchbiggraph_train --rank rank config.py ,其中对每台机器,rank需要替换成从0到N-1的不同的整数。每个机器上必须已经安装了PBG并且都有一份相同的配置文件。
In some uncommon circumstances, one may want to store the embeddings on different machines than the ones that are performing training. In that case, one would set num_partition_servers to a positive value and manually launch some instances of torchbiggraph_partitionserver as well. See below for more information on this.
在某些非常规的情况下,需求是希望将嵌入存放到不同的机器上而不是训练嵌入。在这种情况下,可以配置num_partition_servers及存放的机器数,并且在一些实例上启动torchbiggraph_partitionserver。请参阅下面的详细信息。
Tip:
A good default setting is to set num_machines to half the number of partitions (see below why) and leave num_partition_servers unset.
推荐的默认设置是将num_machines设置为分区数量的一半(参见下面的原因),并让num_partition_servers保持未设置状态。
Once all these commands are started, no more manual intervention is needed.
一旦启动所有这些命令,就不再需要手动干预。
Warning:
Unpartitioned entity types should not be used with distributed training. While the embeddings of partitioned entity types are only in use on one machine at a time and are swapped between machines as needed, the embeddings of unpartitioned entity types are communicated asynchronously through a poorly-optimized parameter server which was designed for sharing relation parameters, which are small. It cannot support synchronizing large amounts of parameters, e.g. an unpartitioned entity type with more than 1000 entities. In that case, the quality of the unpartitioned embeddings will likely be very poor.
未分区的实体不应该用于做分布式训练。原因是尽管分区实体类型的嵌入一次只能在一台计算机上使用,并可以根据需要在计算机之间进行交换,但未分区实体类型的嵌入是通过一个优化不良的参数服务器异步通信的,该服务器是为共享关系而设计的,空间很小,所以不支持同步大量参数。例如,具有1000多个实体的未分区实体类型,在这种情况下,未划分的嵌入的质量可能非常差。
Communication protocols 通信协议
Distributed training requires the machines to coordinate and communicate in various ways for different purposes. These tasks are:
分布式训练包含机器间协同和通信的在不同目标下的不同方式,包含以下几种:
1)synchronizing which trainer is operating on which bucket, assigning them so that there are no conflicts
2)passing the embeddings of an entity partition from one trainer to the next one when needed (as this is data that is only accessed by one trainer at a time)
3)sharing parameters that all trainers need access to simultaneously, by collecting and redistributing the updates to them.
1)广播那个训练器正在操作哪个分桶,通过分配来避免冲突;
2)在需要时将实体分区的嵌入从一个训练器传递到下一个训练器(针对一次只能由一个训练器访问的数据类型)
3)通过收集和重新分发更新参数,共享所有训练器需要同时访问的参数。
Each of these is implemented by a separate “protocol”, and each trainer takes part in some or all of them by launching subprocesses that act as clients or servers for the different protocols. These protocols are explained below to provide insight into the system.
上述的方案都被独立实现为“协议”,每个训练器通过启动子流程作为协议的一部分或者全部,在不同的协议中扮演客户端或者服务端的角色。下面将对这些协议进行说明,以深入了解系统。
Synchronizing bucket access 同步分桶访问
PBG parallelizes training across multiple machines by having them all operate simultaneously on disjoint buckets (i.e., buckets that don’t have any partition in common). Therefore, each partition is in use by up to one machine at a time, and each machine uses up to two partitions (the only exception is for buckets “on the diagonal”, that have the same left- and right-hand side partition). This means that the number of buckets one can simultaneously train on is about half the total number of partitions.
PBG通过在多机上训练不交叉的分桶来实现分布式训练(如:分桶间不会包含有相同的partition)。因此,每个分区一次最多由一台机器使用,每台机器最多使用两个分区(唯一的例外是“对角线上”的存储桶,它们具有相同的左侧和右侧分区)。这意味着可以同时训练的bucket数量大约是分区总数的一半。
The way the machines agree on which one gets to operate on what bucket is through a “lock server”. The server is implicitly started by the trainer of rank 0. All other machines connect to it as clients, ask for a new bucket to operate on (when they need one), get a bucket assigned from the server (or none, if all buckets have already been trained on or are “locked” because their partitions are in use by another trainer), train on it, then report it as done and repeat. The lock server tries to optimize I/O by preferring, when a trainer asks for a bucket, to assign one that has as many partitions in common with the previous bucket that the trainer trained on, so that these partitions can be kept in memory rather than having to be unloaded and reloaded.
这种方式让集群通过“lock server”商定哪个机器可以对哪个桶进行操作。服务器隐式的从排序为0训练器开始,其他所有机器作为他的客户端,请求一个新的分桶来开始训练操作(当他们需要新的分桶时),获得一个服务器指定的分桶(或者未获得,如果所有的分桶都已经被训练或者在“locked”状态,即该partitions被其他训练器使用中),开始训练,训练完成后返回并重复上述动作。lock server 优化I/O的策略:当训练器请求bucket时,尽可能的分配和上一个bucket相邻的bucket到的相同的训练器上,这样这些分区就可以保存在内存中,而不必卸载和重新加载。
Exchanging partition embeddings 分区交换嵌入
When a trainer starts operating on a bucket it needs access to the embeddings of all entities (of all types) that belong to either the left- or the right-hand side partition of the bucket. The “locking” mechanism of the lock server ensures that at most one trainer is operating on a partition at any given time. This doesn’t hold for unpartitioned entity types, which are shared among all trainers; see below. Thus each trainer has exclusive hold of the partitions it’s training on.
当一个训练器开始对一个分桶的数据操作的时候,它需要访问所有实体的嵌入向量embeddings,这些向量不然是分桶的左邻接节点,不然是右邻接节点。locking 机制通过lock server确保在给定时刻内只有一个训练器对partition进行操作。但是这无法处理未分区的实体类型,这需要再所有训练器中共享,参照下文。因此保证了每个训练器对自己持有的partition是独占的。
Once a trainer starts working on a new bucket it needs to acquire the embeddings of its partitions, and once it’s done it needs to release them and make them available, in their updated version, to the next trainer that needs them. In order to do this, there’s a system of so-called “partition servers” that store the embeddings, provide them upon request to the trainers who need them, receive back the updated embedding and store it.
一旦一个训练器开始在新的bucket上工作,就需要获取其分区的嵌入,在完成后释放,并在其更新版本中提供给下一个需要它们的训练器。为了支持上述过程,有一个顾名思义叫“partition servers”的系统来存储嵌入向量,提供给需要这些嵌入的训练器并接收对这些向量的更新并存储。
This service is optional, and is disabled when num_partition_servers is set to zero. In that case the trainers “send” each other the embeddings simply by writing them to the checkpoint directory (which should reside on a shared disk) and then fetching them back from there.
这个服务是可选的,当吧num_partition_servers设置为0时被禁用。在这种情况下,训练器只需将嵌入内容写入检查点目录(该目录应位于共享磁盘上),然后其他训练器从那里取回,就可以相互“传送”。
When this system is enabled, it can operate in two modes. The simplest mode is triggered when num_partition_servers is -1 (the default): in that case all trainers spawn a local process that acts as a partition server. If, on the other hand, num_partition_servers is a positive value then the trainers will not spawn any process, but will instead connect to the partition servers that the user must have provisioned manually by launching the torchbiggraph_partitionserver command on the appropriate number of machines.
启用此系统后,它可以在两种模式下工作。最简单的模式是我们把num_partition_servers设置为-1(默认):在这种情况下,所有训练器会生成一个本地进程作为partition服务。另外一种模式则num_partition_servers是一个正值,那么训练器将不会生成任何进程,但是我们需要通过在适当数量的机器上手动通过torchbiggraph_partitionserver命令来启动该系统。
Updating shared parameters 共享参数更新
Some parameters of the model need to be used by all trainers at the same time (this includes the operator weights, the global embeddings of each entity type, the embeddings of the unpartitioned entities). These are parameters that don’t depend on what bucket the trainer is operating on, and therefore are always present on all trainers (as opposed to the entity embeddings, which are loaded and unloaded as needed). These parameters are synchronized using a series of “parameter servers”. Each trainer starts a local parameter server (in a separate subprocess) and connects to all other parameter servers. Each parameter that is shared between trainers is then stored in a parameter server (possibly sharded across several of them, if too large). Each trainer also has a loop (also in a separate subprocess) which, at regular intervals, goes over each shared parameter, computes the difference between its current local value and the value it had when it was last synced with the server where the parameter is hosted and sends that delta to that server. The server, in turn, accumulates all the deltas it receives from all trainers, updates the value of the parameter and sends this new value back to the trainers. The parameter server performs throttling to 100 updates/s or 1GB/s, in order to prevent the parameter server from starving the other communication.
有些模型参数需要一直被全局的训练器访问(包含操作器operator的权重,每个实体类型的全局嵌入,未分区实体的嵌入向量)。有些参数不依赖于训练器处理的分桶,但所有训练器都需要常驻以使用到(和跟进需要反复加载和卸载的实体嵌入相反)。这些参数通过一组“parameter servers”来同步,每个训练器启动一个本地的参数服务器(一个独立的子进程)并且和其他参数服务器连接。训练器间共享每个参数并且存储在parameter server(如果太大的话可能是分片)。每个训练器也有一个循环(也在一个单独的子进程中)定期遍历每个共享参数,计算其当前本地值和上次与参数的服务器同步时的值之间的差值,并将该差值发送给服务器。服务器依次累加从所有训练器接收到的所有增量,更新参数值并将此新值发送回训练器。同时,为了防止参数服务器使其他通信不足,参数服务器限制100个更新/s或1Gb/s。
Todo:
Mention distributed_tree_init_order.
提及distributed_tree_init_order.