dgl distributed training user guide
通过dgl自定义的server来建立DistGraphServer,然后底层使用rpc协议去取其他机器上的图数据形成一个子图,也就是mini-batch。取到数据的机器再利用dataparallel的方式去并行执行,执行的线程称为客户端,通过torch.distributed.launch去建立。
dgl分布式中主要有server、sampler、trainer
Server processes run on each machine that stores a graph partition (this includes the graph structure and node/edge features). These servers work together to serve the graph data to trainers. Note that one machine may run multiple server processes simultaneously to parallelize computation as well as network communication.也就是说server不仅管数据,也管通信
Sampler processes interact with the servers and sample nodes and edges to generate mini-batches for training.
Trainers contain multiple classes to interact with servers. It has DistGraph to get access to partitioned graph data and has DistEmbedding and DistTensor to access the node/edge features/embeddings. It has DistDataLoader to interact with samplers to get mini-batches.
注意DistEmbedding、DisGraph、DisTensor、DisDataLoader等这几个分布式API
this API builds connections with DGL servers and creates sampler processes
这个dgl.distributed.initialize()函数和ddp.initialize_group()函数类似,定义很多通信变量为下面API服务,但是它也有很多不同的地方,特别是在server和client端的定义上
def initialize(ip_config, num_servers=1, num_workers=0,
max_queue_size=MAX_QUEUE_SIZE, net_type='socket',
num_worker_threads=1):
"""Initialize DGL's distributed module
This function initializes DGL's distributed module. It acts differently in server
or client modes. In the server mode, it runs the server code and never returns.
In the client mode, it builds connections with servers for communication and
creates worker processes for distributed sampling. `num_workers` specifies
the number of sampling worker processes per trainer process.
Users also have to provide the number of server processes on each machine in order
to connect to all the server processes in the cluster of machines correctly.
关于socket
Definition:
A socket is one endpoint of a two-way communication link between two programs running on the network. A socket is bound to a port number so that the TCP layer can identify the application that data is destined to be sent to.
由于现在dgl.distributed.initialize()的net_type只支持socket,所以还是很有必要了解一下socket通信模型,参见what is a socket?
首先server需要给出特定的ip跟port来listen
然后client需要知道server的ip跟port,并发出请求在server的socket(即ip跟port)上rendovous,The client also needs to identify itself to the server so it binds to a local port number that it will use during this connection. This is usually assigned by the system.
If everything goes well, the server accepts the connection. Upon acceptance, the server gets a new socket bound to the same local port and also has its remote endpoint set to the address and port of the client. It needs a new socket so that it can continue to listen to the original socket for connection requests while tending to the needs of the connected client.所以接受之后server会获得两个新的socket?
对于server端的ssh命令,initialize函数会初始化一个DistGraphServer,同时该类是KVServer的一个子类,KVServer听说是一个最基本的数据库,通过key-value的mapping方式进行通信
而每个线程会根据SERVER_ID分为是否是backup_server,一般一台机器上可能有好几个server,但是只有主server,主server才会加载切割。其他backup_server只会加载切割图的book,而这个partition_book在dgl有两种分类,而backup server具体加载哪一种看load_partition_book好像是和切割图的.json文件有关。关于Graph partition book可以参考---------------->dgl docs,主要有两种BasicGraphPartitionBook and RangePartitionBook。
对于主客户端,则加载分割图,并将图拷贝到共享内存中,这个共享内存具体指什么暂时还不清楚,只知道它新建了一个partitionbook,然后将切割图的一些属性(具体就是ndata[‘inner_node’],EID,inner_edge,NID等,inner是一个bool的mask,NID应该是在原始图上的ID)copy到这个shared_memory的partitionbook上,中间的转换还使用了DLpack和dgl自定义的NArray库。感觉应该是告诉不同machine上自己上的数据。
initialize在client端会定义一个全局的SAMPLE_POOL,它是一个DGL自定义的CustomPool类,会根据num_workers(=参数DGL_NUM_SAMPLER)设置进程数,每个进程会执行init_process,每个进程中又会定义num_worker*4的queue,在调用该类submit_task的时候会将相关命令放到队列中,其中submit_task会和DistDataLoader中取下一个batch data的时候互动。因为distdataloader的时候会先取batch data的NodeID,再通过collate_fn(也就是sampler函数)取node特征。
会通过rpc的方式建立KVServer,但是所有的rpc_server都是第一台ip的机器,只是如果有多个server的话,则该机器会开多个端口来作为server。
在.json文件中,通过node_map和edge_map可以得知切割之后的ID分布,但是这和load_partition之后的inner_node(就是在该partition中的node,参考--------------------->distributed的partition部分)以及edge数量不一致
见链接Github–DLPack。
不论是Pytorch还是DGL,都在各种从dlpack到tensor的相互转换,这个dlpack用处就是在各种框架之间相互共享。
Each machine is responsible for one and only one partition. It loads the partition data (the graph structure and the node data and edge data in the partition) and makes it accessible to all trainers in the cluster.
注意这里DisGraph即有单机版也有分布式版
单机版:测试开发,可以测试下单机版DisGraph
分布式版:DistGraph connects with the servers in the cluster of machines and access them through the network. 说明server之间还是通信来传输图信息的啊
常用的统计函数,针对DistGraph类
非常重要的属性,会定义一个NodeDataView类,NodeDataView则会在get_data的时候定义一个DistTensor
Currently, DGL does not provide protection for concurrent writes from multiple trainers when a machine runs multiple servers. This may result in data corruption. One way to avoid concurrent writes to the same row of data is to run one server process on a machine.
怎样在一台机器上运行一个server process?
Internally, distributed embeddings are built on top of distributed tensors, and, thus, has very similar behaviors to distributed tensors. For example, when embeddings are created, they are sharded and stored across all machines in the cluster. It can be uniquely identified by a name.
embedding所有机器共享的话,如果实现通信?Distensor是不是也是共享?
有两种level,但不论哪种level
需要自己写代码定义如何sample
dgl.sampling.sample_neighbors()
For the lower-level sampling API, it provides sample_neighbors() for distributed neighborhood sampling on DistGraph.
所以DisGraph是整张图而DisSampling是采样为minibatch
经典算法NodeDataLoader和EdgeDataLoader
关于异构图的描述
Below is an example adjancency matrix of a heterogeneous graph showing the homogeneous ID assignment. Here, the graph has two types of nodes (T0 and T1 ), and four types of edges (R0, R1, R2, R3 ). There are a total of 400 nodes in the graph and each type has 200 nodes. Nodes of T0 have IDs in [0,200), while nodes of T1 have IDs in [200, 400). In this example, if we use a tuple to identify the nodes, nodes of T0 are identified as (T0, type-wise ID), where type-wise ID falls in [0, 200); nodes of T1 are identified as (T1, type-wise ID), where type-wise ID also falls in [0, 200).
复制切割图和training脚本到指定机器上(via ip_config)
但是需要每台机器之间ssh无密访问
reshuffle=False
, node IDs and edge IDs of a partition do not fall into contiguous具体请参考我的博客
默认端口号是30050,当出现端口占用的时候需要kill掉相关进程
在我的设定里有两台机器,launch.py会原创创建两个server,两个client,通过ssh去启动。两个server分别是两台机器,而在client端则是ip_config.txt的第一个ip为master,调用了pytorch DDP,也即第一个ip为MASTER_ADDR。关于threading.Thread()以及start(),join()的用法可以参考------------------->Python Thread.join()详解
第一个server的cmd
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=0; /home/user/anaconda3/envs/torch/bin/python train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
第二个server
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; /home/user/anaconda3/envs/torch/bin/python train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
第一个client
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=10 ; /home/user/anaconda3/envs/torch/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.1.7 --master_port=1234 train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
第二个client
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=10 ; /home/user/anaconda3/envs/torch/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.1.7 --master_port=1234 train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=10 ; /home/user/anaconda3/envs/torch/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=192.168.1.7 --master_port=1234 train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
再试一下DDP test
还是和之前的一样参考----------------->Pytorch DDP
os.enviorn的作用
export 的作用
为什么export master-addr没有用
为什么export一次之后就一直是这个值
代码在为建立通信之前会在init_group处等待
init_group中的参数值需要系统提供哪些??
还有一个关于pytorch DDP启动文件的issue------------->alternative api for torch.distributed.launch,还是要多关注pytorch和dgl上的discusion。
dgl单机版主要参照graphSAGE
frontier函数中主要调用sampler函数,因为图的相邻结点在DGL中被描述成frontier
当使用单机版时,采用in_subgraph或者sampling.sample_neighbors函数(根据是否采用fanouts判断)
这个函数比较,比in_subgraph功能强大的多
源码里给的example也非常清晰
Examples
--------
Assume that you have the following graph
>>> g = dgl.graph(([0, 0, 1, 1, 2, 2], [1, 2, 0, 1, 2, 0]))
And the weights
>>> g.edata['prob'] = torch.FloatTensor([0., 1., 0., 1., 0., 1.])
To sample one inbound edge for node 0 and node 1:
>>> sg = dgl.sampling.sample_neighbors(g, [0, 1], 1)
>>> sg.edges(order='eid')
(tensor([1, 0]), tensor([0, 1]))
>>> sg.edata[dgl.EID]
tensor([2, 0])
To sample one inbound edge for node 0 and node 1 with probability in edge feature
``prob``:
>>> sg = dgl.sampling.sample_neighbors(g, [0, 1], 1, prob='prob')
>>> sg.edges(order='eid')
(tensor([2, 1]), tensor([0, 1]))
With ``fanout`` greater than the number of actual neighbors and without replacement,
DGL will take all neighbors instead:
>>> sg = dgl.sampling.sample_neighbors(g, [0, 1], 3)
>>> sg.edges(order='eid')
(tensor([1, 2, 0, 1]), tensor([0, 0, 1, 1]))
"""
可以看出EID就是边在原始图中的ID号,而这种采样默认会把子图node和edge的feature信息从原图中一起copy下来
这个函数最有意思的是可以使用pytorch的ddp进行分布式训练,且更改use_ddp=true即可,会直接调用ddp的相关API,通过set_epoch来让每个复制的dataset有不同的ordering在每个epoch期间
If you are using PyTorch's distributed training (e.g. when using
:mod:`torch.nn.parallel.DistributedDataParallel`), you can train the model by turning
on the `use_ddp` option:
>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5])
>>> dataloader = dgl.dataloading.NodeDataLoader(
... g, train_nid, sampler, use_ddp=True,
... batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for epoch in range(start_epoch, n_epochs):
... dataloader.set_epoch(epoch)
... for input_nodes, output_nodes, blocks in dataloader:
... train_on(input_nodes, output_nodes, blocks)
对于num_work以及devices的设置
**Tips for selecting the proper device**
* If the input graph :attr:`g` is on GPU, the output device :attr:`device` must be the same GPU
and :attr:`num_workers` must be zero. In this case, the sampling and subgraph construction
will take place on the GPU. This is the recommended setting when using a single-GPU and
the whole graph fits in GPU memory.
* If the input graph :attr:`g` is on CPU while the output device :attr:`device` is GPU, then
depending on the value of :attr:`num_workers`:
- If :attr:`num_workers` is set to 0, the sampling will happen on the CPU, and then the
subgraphs will be constructed directly on the GPU. This is the recommend setting in
multi-GPU configurations.
- Otherwise, if :attr:`num_workers` is greater than 0, both the sampling and subgraph
construction will take place on the CPU. This is the recommended setting when using a
single-GPU and the whole graph does not fit in GPU memory.
总体最简单的调用为
r"""An abstract class representing a :class:`Dataset`.
All datasets that represent a map from keys to data samples should subclass
it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
data sample for a given key. Subclasses could also optionally overwrite
:meth:`__len__`, which is expected to return the size of the dataset by many
:class:`~torch.utils.data.Sampler` implementations and the default options
of :class:`~torch.utils.data.DataLoader`.
.. note::
:class:`~torch.utils.data.DataLoader` by default constructs a index
sampler that yields integral indices. To make it work with a map-style
dataset with non-integral indices/keys, a custom sampler must be provided.
"""
g = dgl.graph(([0, 0, 1, 1, 2, 2], [1, 2, 0, 1, 2, 0]))
g.edata['prob'] = torch.FloatTensor([0., 1., 0., 1., 0., 1.])
sg = dgl.sampling.sample_neighbors(g, [0, 1], 1)
print(sg.edges(order='eid'))
print(sg.edata[dgl.EID])
输出结果
(tensor([1, 0]), tensor([0, 1]))
tensor([2, 0])
而且每次结果都不一样,这段代码中我没有添加边的因素,说明这种sample方式是随机的,eid表示还是根据原边的形式,而EID则是eid在原边中的ID。
多机版时使用distributed.graph_services.py中的sampler_neighbors函数,使用issue_remote_req来访问其他机器上的子图信息,local_access访问本机上子图信息(所有图均为DistGraph形式)。
将block(即图)和src_feature传给SAGE
然后利用graph.num_dst_nodes()取出node中前k个node,这些为dst_node并取出dst_feature.
接下来就是把src_feature进行aggregate了,选择最简单的“mean”的方法,这里可以看到DGL在这块儿进行了优化,即进行一个判断,判断经过DNN处理后的feature维度是否减小,如果减小了就先aggregate在DNN,反则反之。
参照DGL的用户手册-------------------->Build in function and API calls
,这里面讲到了很多信息传递的API,想message函数(针对边),reduce函数(针对node)以及updateall函数,以及一些在dgl中常见的u_add_v以及apply_edge函数,注意在dgl中u一般指src_node,v一般指dst_node。