dgl distributed training user guide
Server processes run on each machine that stores a graph partition (this includes the graph structure and node/edge features). These servers work together to serve the graph data to trainers. Note that one machine may run multiple server processes simultaneously to parallelize computation as well as network communication.也就是说server不仅管数据,也管通信
Sampler processes interact with the servers and sample nodes and edges to generate mini-batches for training.
Trainers contain multiple classes to interact with servers. It has DistGraph to get access to partitioned graph data and has DistEmbedding and DistTensor to access the node/edge features/embeddings. It has DistDataLoader to interact with samplers to get mini-batches.
this API builds connections with DGL servers and creates sampler processes
def initialize(ip_config, num_servers=1, num_workers=0,
max_queue_size=MAX_QUEUE_SIZE, net_type='socket',
"""Initialize DGL's distributed module
This function initializes DGL's distributed module. It acts differently in server
or client modes. In the server mode, it runs the server code and never returns.
In the client mode, it builds connections with servers for communication and
creates worker processes for distributed sampling. `num_workers` specifies
the number of sampling worker processes per trainer process.
Users also have to provide the number of server processes on each machine in order
to connect to all the server processes in the cluster of machines correctly.
A socket is one endpoint of a two-way communication link between two programs running on the network. A socket is bound to a port number so that the TCP layer can identify the application that data is destined to be sent to.
由于现在dgl.distributed.initialize()的net_type只支持socket,所以还是很有必要了解一下socket通信模型,参见what is a socket?
然后client需要知道server的ip跟port,并发出请求在server的socket(即ip跟port)上rendovous,The client also needs to identify itself to the server so it binds to a local port number that it will use during this connection. This is usually assigned by the system.
If everything goes well, the server accepts the connection. Upon acceptance, the server gets a new socket bound to the same local port and also has its remote endpoint set to the address and port of the client. It needs a new socket so that it can continue to listen to the original socket for connection requests while tending to the needs of the connected client.所以接受之后server会获得两个新的socket?
而每个线程会根据SERVER_ID分为是否是backup_server,一般一台机器上可能有好几个server,但是只有主server,主server才会加载切割。其他backup_server只会加载切割图的book,而这个partition_book在dgl有两种分类,而backup server具体加载哪一种看load_partition_book好像是和切割图的.json文件有关。关于Graph partition book可以参考---------------->dgl docs,主要有两种BasicGraphPartitionBook and RangePartitionBook。
initialize在client端会定义一个全局的SAMPLE_POOL,它是一个DGL自定义的CustomPool类,会根据num_workers(=参数DGL_NUM_SAMPLER)设置进程数,每个进程会执行init_process,每个进程中又会定义num_worker*4的queue,在调用该类submit_task的时候会将相关命令放到队列中,其中submit_task会和DistDataLoader中取下一个batch data的时候互动。因为distdataloader的时候会先取batch data的NodeID,再通过collate_fn(也就是sampler函数)取node特征。
Each machine is responsible for one and only one partition. It loads the partition data (the graph structure and the node data and edge data in the partition) and makes it accessible to all trainers in the cluster.
分布式版:DistGraph connects with the servers in the cluster of machines and access them through the network. 说明server之间还是通信来传输图信息的啊
Currently, DGL does not provide protection for concurrent writes from multiple trainers when a machine runs multiple servers. This may result in data corruption. One way to avoid concurrent writes to the same row of data is to run one server process on a machine.
怎样在一台机器上运行一个server process?
Internally, distributed embeddings are built on top of distributed tensors, and, thus, has very similar behaviors to distributed tensors. For example, when embeddings are created, they are sharded and stored across all machines in the cluster. It can be uniquely identified by a name.
For the lower-level sampling API, it provides sample_neighbors() for distributed neighborhood sampling on DistGraph.
Below is an example adjancency matrix of a heterogeneous graph showing the homogeneous ID assignment. Here, the graph has two types of nodes (T0 and T1 ), and four types of edges (R0, R1, R2, R3 ). There are a total of 400 nodes in the graph and each type has 200 nodes. Nodes of T0 have IDs in [0,200), while nodes of T1 have IDs in [200, 400). In this example, if we use a tuple to identify the nodes, nodes of T0 are identified as (T0, type-wise ID), where type-wise ID falls in [0, 200); nodes of T1 are identified as (T1, type-wise ID), where type-wise ID also falls in [0, 200).
复制切割图和training脚本到指定机器上(via ip_config)
, node IDs and edge IDs of a partition do not fall into contiguous具体请参考我的博客
在我的设定里有两台机器,launch.py会原创创建两个server,两个client,通过ssh去启动。两个server分别是两台机器,而在client端则是ip_config.txt的第一个ip为master,调用了pytorch DDP,也即第一个ip为MASTER_ADDR。关于threading.Thread()以及start(),join()的用法可以参考------------------->Python Thread.join()详解
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=0; /home/user/anaconda3/envs/torch/bin/python train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; /home/user/anaconda3/envs/torch/bin/python train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=10 ; /home/user/anaconda3/envs/torch/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr= --master_port=1234 train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=10 ; /home/user/anaconda3/envs/torch/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr= --master_port=1234 train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
‘cd /home/user/gnn-tutorial/graphsage/experimental; (export PATH=$PATH:/home/user/anaconda3/bin; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=10 ; /home/user/anaconda3/envs/torch/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr= --master_port=1234 train_dist_noprof.py --graph_name reddit --ip_config ip_config.txt --num_gpus 1 --local_rank 0 --num_epochs 3 --batch_size 1000))’
再试一下DDP test
还是和之前的一样参考----------------->Pytorch DDP
export 的作用
为什么export master-addr没有用
还有一个关于pytorch DDP启动文件的issue------------->alternative api for torch.distributed.launch,还是要多关注pytorch和dgl上的discusion。
Assume that you have the following graph
>>> g = dgl.graph(([0, 0, 1, 1, 2, 2], [1, 2, 0, 1, 2, 0]))
And the weights
>>> g.edata['prob'] = torch.FloatTensor([0., 1., 0., 1., 0., 1.])
To sample one inbound edge for node 0 and node 1:
>>> sg = dgl.sampling.sample_neighbors(g, [0, 1], 1)
>>> sg.edges(order='eid')
(tensor([1, 0]), tensor([0, 1]))
>>> sg.edata[dgl.EID]
tensor([2, 0])
To sample one inbound edge for node 0 and node 1 with probability in edge feature
>>> sg = dgl.sampling.sample_neighbors(g, [0, 1], 1, prob='prob')
>>> sg.edges(order='eid')
(tensor([2, 1]), tensor([0, 1]))
With ``fanout`` greater than the number of actual neighbors and without replacement,
DGL will take all neighbors instead:
>>> sg = dgl.sampling.sample_neighbors(g, [0, 1], 3)
>>> sg.edges(order='eid')
(tensor([1, 2, 0, 1]), tensor([0, 0, 1, 1]))
If you are using PyTorch's distributed training (e.g. when using
:mod:`torch.nn.parallel.DistributedDataParallel`), you can train the model by turning
on the `use_ddp` option:
>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5])
>>> dataloader = dgl.dataloading.NodeDataLoader(
... g, train_nid, sampler, use_ddp=True,
... batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for epoch in range(start_epoch, n_epochs):
... dataloader.set_epoch(epoch)
... for input_nodes, output_nodes, blocks in dataloader:
... train_on(input_nodes, output_nodes, blocks)
**Tips for selecting the proper device**
* If the input graph :attr:`g` is on GPU, the output device :attr:`device` must be the same GPU
and :attr:`num_workers` must be zero. In this case, the sampling and subgraph construction
will take place on the GPU. This is the recommended setting when using a single-GPU and
the whole graph fits in GPU memory.
* If the input graph :attr:`g` is on CPU while the output device :attr:`device` is GPU, then
depending on the value of :attr:`num_workers`:
- If :attr:`num_workers` is set to 0, the sampling will happen on the CPU, and then the
subgraphs will be constructed directly on the GPU. This is the recommend setting in
multi-GPU configurations.
- Otherwise, if :attr:`num_workers` is greater than 0, both the sampling and subgraph
construction will take place on the CPU. This is the recommended setting when using a
single-GPU and the whole graph does not fit in GPU memory.
r"""An abstract class representing a :class:`Dataset`.
All datasets that represent a map from keys to data samples should subclass
it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
data sample for a given key. Subclasses could also optionally overwrite
:meth:`__len__`, which is expected to return the size of the dataset by many
:class:`~torch.utils.data.Sampler` implementations and the default options
of :class:`~torch.utils.data.DataLoader`.
.. note::
:class:`~torch.utils.data.DataLoader` by default constructs a index
sampler that yields integral indices. To make it work with a map-style
dataset with non-integral indices/keys, a custom sampler must be provided.
g = dgl.graph(([0, 0, 1, 1, 2, 2], [1, 2, 0, 1, 2, 0]))
g.edata['prob'] = torch.FloatTensor([0., 1., 0., 1., 0., 1.])
sg = dgl.sampling.sample_neighbors(g, [0, 1], 1)
(tensor([1, 0]), tensor([0, 1]))
tensor([2, 0])
参照DGL的用户手册-------------------->Build in function and API calls