Horovod安装和使用

1.        简介

Horovod是TensorFlow、Keras和PyTorch的分布式培训框架。Horovod的目标是使分布式深度学习快速且易于使用。

2.        安装

https://github.com/uber/horovod/blob/master/docs/gpus.md

  1. 安装NCCL

https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html

下载nccl_2.1.15-1cuda9.1

https://developer.nvidia.com/compute/machine-learning/nccl/secure/v2.1/prod/nccl_2.1.15-1cuda9.1_x86_64

[root@node local]# tar xvf nccl_2.1.15-1+cuda9.1_x86_64.txz

[root@node local]# mv nccl_2.1.15-1+cuda9.1_x86_64 nccl_2.1.15

在/etc/profile添加:

export LD_LIBRARY_PATH=/usr/local/nccl_2.1.15/lib:$LD_LIBRARY_PATH

  1. 安装Openmpi

https://www.open-mpi.org/faq/?category=building#easy-build

openmpi-4.0.0有问题,安装openmpi-3.1.2

[root@node tools]# wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz

[root@node tools]# tar zxvf openmpi-3.1.2.tar.gz

[root@node tools]# cd openmpi-3.1.2/

[root@node openmpi-3.1.2]# ./configure --prefix=/usr/local

[root@node openmpi-3.1.2]# make -j8

[root@node openmpi-3.1.2]# make install

  1. 安装horovod

HOROVOD_NCCL_HOME=/usr/local/nccl_2.1.15 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

安装自己用户下

HOROVOD_NCCL_HOME=/usr/local/nccl_2.4.2 HOROVOD_GPU_ALLREDUCE=NCCL pip install --prefix=~/tf_bin --ignore-installed --no-cache-dir horovod

HOROVOD_CUDA_HOME=/usr/local/cuda-10.0 HOROVOD_NCCL_HOME=/usr/local/nccl_2.4.2 HOROVOD_GPU_ALLREDUCE=NCCL pip install --prefix=~/tf_bin --no-cache-dir horovod

设置环境变量:

export PYTHONPATH=~/tf_bin/lib64/python2.7/site-packages

 

一些计算通信比较高的模型,可以把allreduce放在CPU上做:

opt = hvd.DistributedOptimizer(opt, device_dense='/cpu:0')

3.        用法

import tensorflow as tf

import horovod.tensorflow as hvd

 

 

# Initialize Horovod

hvd.init()

 

# Pin GPU to be used to process local rank (one GPU per process)

# 每个进程分配一个GPU,第一个进程使用第一个GPU,以此类推

config = tf.ConfigProto()

config.gpu_options.visible_device_list = str(hvd.local_rank())

 

# Build model...

loss = ...

#根据worker的数量来衡量学习率。同步分布式训练的有效batchsize是由worker的数量决定的。学习率的增加弥补了batchsize的增加。

opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

 

# Add Horovod Distributed Optimizer

# 使用 Horovod 优化器包裹每一个常规 TensorFlow 优化器,Horovod 优化器使用 ring-allreduce 平均梯度。

opt = hvd.DistributedOptimizer(opt)

 

# Add hook to broadcast variables from rank 0 to all other processes during initialization.

# 将变量从第一个流程向其他流程传播,以实现一致性初始化。如果该项目无法使用 MonitoredTrainingSession,则用户可以运行 hvd.broadcast_global_variables(0)。

hooks = [hvd.BroadcastGlobalVariablesHook(0)]

 

# Make training operation

train_op = opt.minimize(loss)

 

# Save checkpoints only on worker 0 to prevent other workers from corrupting them.

# 修改监测点,只有0号进程保存监测点

checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

 

# The MonitoredTrainingSession takes care of session initialization,

# restoring from a checkpoint, saving to a checkpoint, and closing when done

# or an error occurs.

with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,

                                       config=config,

                                       hooks=hooks) as mon_sess:

  while not mon_sess.should_stop():

    # Perform synchronous training.

    mon_sess.run(train_op)

 

4.        运行Horovod

https://github.com/uber/horovod/blob/master/docs/running.md

一台机器4个GPU:

$ mpirun -np 4 \

    -H localhost:4 \

    -bind-to none -map-by slot \

    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \

    -mca pml ob1 -mca btl ^openib \

    python train.py

注释:

-bind-to none 指定OpenMPI不将训练过程绑定到单个CPU核心(绑定会损害性能)

-map-by slot允许您混合使用不同的NUMA配置,默认是绑定到套接字

-mca pml ob1 -mca btl ^openib强制使用MPI的TCP协议,为了避免OpenMPI在使用RDMA中出现的各种问题,这不会影响性能,因为通信由NCCL代替。

通过-x选项,可以指定(-x NCCL_DEBUG=INFO)或将环境变量(-x LD_LIBRARY_PATH)复制到所有worker。

 

你可能感兴趣的:(AI)