目前工作中只使用了单机多卡做微调训练,为了提升训练效率,特实验多机多卡分布式训练。
本试验使用两台机器(manager,worker),操作系统ubuntu 22.4,每台机器有4个GPU
为了使安装配置统一,使用docker容器,docker 的安装这里不做介绍。
初始化集群,在manager机器上运行:
docker swarm init
#输出结果:
Swarm initialized: current node (k4ehuhg4a2umpjoo7yovy1caf) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
加入集群,在worker机器上运行:
docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377
在 manager 中创建 overlay 网络,执行命令:
docker network create --driver=overlay --attachable test-net
执行命令docker network ls查看当前网络状态,可以看到最后一行,已经创建好了
NETWORK ID NAME DRIVER SCOPE
ec8c853e521d bridge bridge local
72574615b63f docker_gwbridge bridge local
9fbe2f6c3b22 freeaskinternet_default bridge local
b8273bdcc836 host host local
ii71ul2agult ingress overlay swarm
eadcc6c24a81 none null local
fxnzpd6r1hr0 sharednet overlay swarm
wdoj2fcw29np test-net overlay swarm
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/bin/docker-compose
sudo chmod +x /usr/bin/docker-compose
docker-compose --version
mkdir work
cd work
#Dockerfile
from nvcr.io/nvidia/cuda:12.2.0-devel-ubuntu22.04
# 更新系统包
RUN apt-get update && apt-get install -y git build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libsqlite3-dev libreadline-dev libffi-dev liblzma-dev libbz2-dev curl wget net-tools iputils-ping pdsh
# 安装Python
WORKDIR /home/user
RUN wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tgz && \
tar -zvxf Python-3.10.6.tgz && cd Python-3.10.6 && \
./configure --enable-optimizations && make -j 4 && make install
version: "3"
services:
llmtrain:
build:
context: .
dockerfile: Dockerfile
container_name: llmtrain
tty: true
restart: always
ulimits:
memlock: -1
stack: 67108864
shm_size: 40G
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
volumes:
- ./code:/home/user/code:cached
networks:
- test-net
networks:
test-net:
external: true
sudo docker-compose up -d --build
sudo docker exec -it <容器ID> /bin/bash
ifconfig -a
eth0: flags=4163 mtu 1450
inet 10.0.1.14 netmask 255.255.255.0 broadcast 10.0.1.255
ether 02:42:0a:00:01:0e txqueuelen 0 (Ethernet)
RX packets 2170444797 bytes 11730029590467 (11.7 TB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1371803017 bytes 11419623920546 (11.4 TB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163 mtu 1500
inet 172.18.0.3 netmask 255.255.0.0 broadcast 172.18.255.255
ether 02:42:ac:12:00:03 txqueuelen 0 (Ethernet)
RX packets 74646 bytes 395241942 (395.2 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 44728 bytes 3336632 (3.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73 mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 161709 bytes 15509786 (15.5 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 161709 bytes 15509786 (15.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
分别进入两个容器中查看ip地址后,互ping 一下对方看网络是否正常。
pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install deepspeed
注从2-10步需要在另一台机器worker上也执行一遍
首先分别去manager,worker节点的容器中安装openssh-server服务并启动
# 安装ssh服务
apt-get install openssh-server -y
# 启动ssh服务
/etc/init.d/ssh start
注意:以下操作都是在manager,worker节点容器内部
分别去manager,worker节点的容器中执行 ssh-keygen -t rsa 命令,一直回车即可。
ssh-keygen -t rsa
在复制文件内容时要注意一下回车换行符。
接着分别去manager,worker节点的/etc/hosts文件中增加映射
10.0.1.14 worker
10.0.1.16 manager
最后测试容器之间是否可以免密登录,如果配置正确,应该不需要输入密码就可以登录。
ssh manager
ssh worker
在~/.bashrc文件中增加以下内容:
#需要注意NCCL的配置,这里需要根据机器的情况指定NCCL的通讯网卡,这里用的是eth0可以通过ifconfig -a命令查看
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
然后不要忘了 source .bashrc 让其生效,worker节点要执行同样的操作。
本试验使用的是bloom-7B模型。
由于huggingface trainer对deepspeed的支持非常友好,只需要一个配置参数即可:
#slots表示对应机器上可供使用的gpu数量
manager slots=4
worker slots=4
deepspeed_config对应文件ds_config_1.json的内容:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"bfloat16": {
"enabled": false
},
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 1,
"reduce_bucket_size": 5e8
},
"steps_per_print": 10,
"wall_clock_breakdown": false,
"checkpoint": {
"use_node_local_storage": true
}
}
本试验zeRO使用了stage 1.在实际使用过程中请结合模型大小以及gpu情况做合适的配置。
具体内容可以参考:Zero Redundancy Optimizer - DeepSpeed
deepspeed --hostfile hostfile finetune.py --deepspeed ./ds_config_1.json
oot@b50557cdc89c:/home/user/code# nvidia-smi
Wed May 29 02:08:43 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... Off | 00000000:34:00.0 Off | 0 |
| N/A 56C P0 83W / 300W | 79577MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... Off | 00000000:35:00.0 Off | 0 |
| N/A 54C P0 78W / 300W | 80555MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800 80G... Off | 00000000:9D:00.0 Off | 0 |
| N/A 62C P0 99W / 300W | 80379MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800 80G... Off | 00000000:9E:00.0 Off | 0 |
| N/A 59C P0 91W / 300W | 80763MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
deepspeed多机分布式训练若NCCL使用socket通信,速度非常慢,还不如单机多卡速度快!建议使用IB通信,但需要有相关硬件支撑。
经过nccl-test测试发现多机nccl基于socket通信的速度是单机的十分之一。多机的速度是0.3G/s,单机的是4G/s.
#安装NCCL,NCCL已经支持软件源安装
apt install libnccl2 libnccl-dev
#查看是否安装成功
ldconfig -p | grep libnccl
#安装mpich
apt-get install mpich
#安装nccl-test
#下载:https://github.com/nvidia/nccl-tests或
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make mpi=1
#测试单机
#mpi方式
mpirun -np 4 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
#mpi测试多机
mpirun -np 8 -hosts manager,worker -map-by slot -env NCCL_DEBUG INFO -env NCCL_SOCKET_IFNAME eth0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
docker容器中deepspeed多机多卡集群分布式训练大模型 - 简书
当你拿到一台GPU服务器后,你要做什么 - 知乎
[docker]nvidia的cuda镜像列表-CSDN博客
https://github.com/NVIDIA/nccl/issues/318
DistributedDataParallel on multiple GPU nodes slower than one GPU node - #2 by mrshenli - PyTorch Forums
nccl-test 使用指引-腾讯云开发者社区-腾讯云