1. 简介
Horovod是TensorFlow、Keras和PyTorch的分布式培训框架。Horovod的目标是使分布式深度学习快速且易于使用。
2. 安装
https://github.com/uber/horovod/blob/master/docs/gpus.md
https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
下载nccl_2.1.15-1cuda9.1
https://developer.nvidia.com/compute/machine-learning/nccl/secure/v2.1/prod/nccl_2.1.15-1cuda9.1_x86_64
[root@node local]# tar xvf nccl_2.1.15-1+cuda9.1_x86_64.txz
[root@node local]# mv nccl_2.1.15-1+cuda9.1_x86_64 nccl_2.1.15
在/etc/profile添加:
export LD_LIBRARY_PATH=/usr/local/nccl_2.1.15/lib:$LD_LIBRARY_PATH
https://www.open-mpi.org/faq/?category=building#easy-build
openmpi-4.0.0有问题,安装openmpi-3.1.2
[root@node tools]# wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz
[root@node tools]# tar zxvf openmpi-3.1.2.tar.gz
[root@node tools]# cd openmpi-3.1.2/
[root@node openmpi-3.1.2]# ./configure --prefix=/usr/local
[root@node openmpi-3.1.2]# make -j8
[root@node openmpi-3.1.2]# make install
HOROVOD_NCCL_HOME=/usr/local/nccl_2.1.15 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
安装自己用户下
HOROVOD_NCCL_HOME=/usr/local/nccl_2.4.2 HOROVOD_GPU_ALLREDUCE=NCCL pip install --prefix=~/tf_bin --ignore-installed --no-cache-dir horovod
HOROVOD_CUDA_HOME=/usr/local/cuda-10.0 HOROVOD_NCCL_HOME=/usr/local/nccl_2.4.2 HOROVOD_GPU_ALLREDUCE=NCCL pip install --prefix=~/tf_bin --no-cache-dir horovod
设置环境变量:
export PYTHONPATH=~/tf_bin/lib64/python2.7/site-packages
一些计算通信比较高的模型,可以把allreduce放在CPU上做:
opt = hvd.DistributedOptimizer(opt, device_dense='/cpu:0')
3. 用法
import tensorflow as tf import horovod.tensorflow as hvd
# Initialize Horovod hvd.init()
# Pin GPU to be used to process local rank (one GPU per process) # 每个进程分配一个GPU,第一个进程使用第一个GPU,以此类推 config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model... loss = ... #根据worker的数量来衡量学习率。同步分布式训练的有效batchsize是由worker的数量决定的。学习率的增加弥补了batchsize的增加。 opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer # 使用 Horovod 优化器包裹每一个常规 TensorFlow 优化器,Horovod 优化器使用 ring-allreduce 平均梯度。 opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during initialization. # 将变量从第一个流程向其他流程传播,以实现一致性初始化。如果该项目无法使用 MonitoredTrainingSession,则用户可以运行 hvd.broadcast_global_variables(0)。 hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them. # 修改监测点,只有0号进程保存监测点 checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir, config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op) |
4. 运行Horovod
https://github.com/uber/horovod/blob/master/docs/running.md
一台机器4个GPU:
$ mpirun -np 4 \ -H localhost:4 \ -bind-to none -map-by slot \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ -mca pml ob1 -mca btl ^openib \ python train.py |
注释:
-bind-to none 指定OpenMPI不将训练过程绑定到单个CPU核心(绑定会损害性能)
-map-by slot允许您混合使用不同的NUMA配置,默认是绑定到套接字
-mca pml ob1 -mca btl ^openib强制使用MPI的TCP协议,为了避免OpenMPI在使用RDMA中出现的各种问题,这不会影响性能,因为通信由NCCL代替。
通过-x选项,可以指定(-x NCCL_DEBUG=INFO)或将环境变量(-x LD_LIBRARY_PATH)复制到所有worker。