在上一篇文章《为什么AI需要分布式并行?》中,我们介绍了AI算力为什么需要分布式并行、分布式并行的策略、以及MindSpore框架实现自动、半自动并行的原理。在本篇文章中,我们尝试基于MindSpore,使用自动并行策略完成Resnet-50模型的分布式训练,该实践步骤如下所示。
首先需要下载ResNet50训练的CIFAR-10数据集,该数据集由10类32*32的彩色图片组成,每类包含6000张图片。其中训练集共50000张图片,测试集共10000张图片。将数据集下载并解压到本地路径下,解压后的文件夹为cifar-10-batches-bin,数据集下载链接 。
在裸机环境进行分布式训练时,需要配置当前多卡环境的组网信息文件。以Ascend 910 AI处理器为例,1个8卡环境的json配置文件示例如下,本样例将该配置文件命名为rank_table_8pcs.json。
{
"board_id": "0x0000",
"chip_info": "910",
"deploy_mode": "lab",
"group_count": "1",
"group_list": [
{
"device_num": "8",
"server_num": "1",
"group_name": "",
"instance_count": "8",
"instance_list": [
{"devices": [{"device_id": "0","device_ip": "192.1.27.6"}],"rank_id": "0","server_id": "10.155.111.140"},
{"devices": [{"device_id": "1","device_ip": "192.2.27.6"}],"rank_id": "1","server_id": "10.155.111.140"},
{"devices": [{"device_id": "2","device_ip": "192.3.27.6"}],"rank_id": "2","server_id": "10.155.111.140"},
{"devices": [{"device_id": "3","device_ip": "192.4.27.6"}],"rank_id": "3","server_id": "10.155.111.140"},
{"devices": [{"device_id": "4","device_ip": "192.1.27.7"}],"rank_id": "4","server_id": "10.155.111.140"},
{"devices": [{"device_id": "5","device_ip": "192.2.27.7"}],"rank_id": "5","server_id": "10.155.111.140"},
{"devices": [{"device_id": "6","device_ip": "192.3.27.7"}],"rank_id": "6","server_id": "10.155.111.140"},
{"devices": [{"device_id": "7","device_ip": "192.4.27.7"}],"rank_id": "7","server_id": "10.155.111.140"}
]
}
],
"para_plane_nic_location": "device",
"para_plane_nic_name": ["eth0","eth1","eth2","eth3","eth4","eth5","eth6","eth7"],
"para_plane_nic_num": "8",
"status": "completed"
}
其中,以下参数需要根据实际训练环境修改:
MindSpore分布式并行训练的通信使用了华为集合通信库Huawei Collective Communication Library(以下简称HCCL),可以在Ascend AI处理器配套的软件包中找到。同时mindspore.communication.management中封装了HCCL提供的集合通信接口,方便用户配置分布式信息。HCCL实现了基于Ascend AI处理器的多机多卡通信,我们列出使用分布式服务常见的一些使用限制,详细的可以查看HCCL对应的使用文档。
下面是调用集合通信库样例代码:
import os
from mindspore import context
from mindspore.communication.management import init
if __name__ == "__main__":
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=int(os.environ["DEVICE_ID"]))
init()
...
其中,
分布式训练时,数据是以数据并行的方式导入的。下面的代码通过数据并行的方式加载了CIFAR-10数据集,代码中的data_path是指数据集的路径,即cifar-10-batches-bin文件夹的路径。
import mindspore.common.dtype as mstype
import mindspore.dataset as ds
import mindspore.dataset.transforms.c_transforms as C
import mindspore.dataset.transforms.vision.c_transforms as vision
from mindspore.communication.management import get_rank, get_group_size
def create_dataset(data_path, repeat_num=1, batch_size=32, rank_id=0, rank_size=1):
resize_height = 224
resize_width = 224
rescale = 1.0 / 255.0
shift = 0.0
# get rank_id and rank_size
rank_id = get_rank()
rank_size = get_group_size()
data_set = ds.Cifar10Dataset(data_path, num_shards=rank_size, shard_id=rank_id)
# define map operations
random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4))
random_horizontal_op = vision.RandomHorizontalFlip()
resize_op = vision.Resize((resize_height, resize_width))
rescale_op = vision.Rescale(rescale, shift)
normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023))
changeswap_op = vision.HWC2CHW()
type_cast_op = C.TypeCast(mstype.int32)
c_trans = [random_crop_op, random_horizontal_op]
c_trans += [resize_op, rescale_op, normalize_op, changeswap_op]
# apply map operations on images
data_set = data_set.map(input_columns="label", operations=type_cast_op)
data_set = data_set.map(input_columns="image", operations=c_trans)
# apply shuffle operations
data_set = data_set.shuffle(buffer_size=10)
# apply batch operations
data_set = data_set.batch(batch_size=batch_size, drop_remainder=True)
# apply repeat operations
data_set = data_set.repeat(repeat_num)
return data_set
与单机不同处,在数据集接口需要传入num_shards和shard_id参数,分别对应卡的数量和逻辑序号,建议通过HCCL接口获取:
数据并行及自动并行模式下,网络定义方式与单机一致。代码请参考ResNet50实现。
自动并行以算子为粒度切分模型,通过算法搜索得到最优并行策略,所以与单机训练不同的是,为了有更好的并行训练效果,损失函数建议使用小算子来实现。
在Loss部分,我们采用SoftmaxCrossEntropyWithLogits的展开形式,即按照数学公式,将其展开为多个小算子进行实现,样例代码如下:
from mindspore.ops import operations as P
from mindspore import Tensor
import mindspore.ops.functional as F
import mindspore.common.dtype as mstype
import mindspore.nn as nn
class SoftmaxCrossEntropyExpand(nn.Cell):
def __init__(self, sparse=False):
super(SoftmaxCrossEntropyExpand, self).__init__()
self.exp = P.Exp()
self.sum = P.ReduceSum(keep_dims=True)
self.onehot = P.OneHot()
self.on_value = Tensor(1.0, mstype.float32)
self.off_value = Tensor(0.0, mstype.float32)
self.div = P.Div()
self.log = P.Log()
self.sum_cross_entropy = P.ReduceSum(keep_dims=False)
self.mul = P.Mul()
self.mul2 = P.Mul()
self.mean = P.ReduceMean(keep_dims=False)
self.sparse = sparse
self.max = P.ReduceMax(keep_dims=True)
self.sub = P.Sub()
def construct(self, logit, label):
logit_max = self.max(logit, -1)
exp = self.exp(self.sub(logit, logit_max))
exp_sum = self.sum(exp, -1)
softmax_result = self.div(exp, exp_sum)
if self.sparse:
label = self.onehot(label, F.shape(logit)[1], self.on_value, self.off_value)
softmax_result_log = self.log(softmax_result)
loss = self.sum_cross_entropy((self.mul(softmax_result_log, label)), -1)
loss = self.mul2(F.scalar_to_array(-1.0), loss)
loss = self.mean(loss, -1)
return loss
定义优化器:采用Momentum优化器作为参数更新工具,这里定义与单机一致,不再展开,具体可以参考样例代码中的实现。
context.set_auto_parallel_context是配置并行训练参数的接口,必须在Model初始化前调用。如用户未指定参数,框架会自动根据并行模式为用户设置参数的经验值。如数据并行模式下,parameter_broadcast默认打开。主要参数包括:
如脚本中存在多个网络用例,请在执行下个用例前调用context.reset_auto_parallel_context将所有参数还原到默认值。在下面的样例中我们指定并行模式为自动并行,用户如需切换为数据并行模式,只需将parallel_mode改为DATA_PARALLEL。
from mindspore import context
from mindspore.nn.optim.momentum import Momentum
from mindspore.train.callback import LossMonitor
from mindspore.train.model import Model, ParallelMode
from resnet import resnet50
device_id = int(os.getenv('DEVICE_ID'))
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
context.set_context(device_id=device_id) # set device_id
def test_train_cifar(epoch_size=10):
context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, mirror_mean=True)
loss_cb = LossMonitor()
dataset = create_dataset(data_path, epoch_size)
batch_size = 32
num_classes = 10
net = resnet50(batch_size, num_classes)
loss = SoftmaxCrossEntropyExpand(sparse=True)
opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9)
model = Model(net, loss_fn=loss, optimizer=opt)
model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=True)
其中,
上述已将训练所需的脚本编辑好了,接下来通过命令调用对应的脚本。目前MindSpore分布式执行采用单卡单进程运行方式,即每张卡上运行1个进程,进程数量与使用的卡的数量一致。其中,0卡在前台执行,其他卡放在后台执行。每个进程创建1个目录,用来保存日志信息以及算子编译信息。下面以使用8张卡的分布式训练脚本为例,演示如何运行脚本:
#!/bin/bash
export DATA_PATH=${DATA_PATH:-$1}
export RANK_TABLE_FILE=$(pwd) /rank_table_8pcs.json
export RANK_SIZE=8
for((i=1;i<${RANK_SIZE};i++))
do
rm -rf device$i
mkdir device$i
cp ./resnet50_distributed_training.py ./resnet.py ./device$i
cd ./device$i
export DEVICE_ID=$i
export RANK_ID=$i
echo "start training for device $i"
env > env$i.log
pytest -s -v ./resnet50_distributed_training.py > train.log$i 2>&1 &
cd ../
done
rm -rf device0
mkdir device0
cp ./resnet50_distributed_training.py ./resnet.py ./device0
cd ./device0
export DEVICE_ID=0
export RANK_ID=0
echo "start training for device 0"
env > env0.log
pytest -s -v ./resnet50_distributed_training.py > train.log0 2>&1
if [ $? -eq 0 ];then
echo "training success"
else
echo "training failed"
exit 2
fi
cd ../
脚本需要传入数据集的路径变量DATA_PATH。其中需要设置的环境变量有,
启动脚本运行时间大约在5分钟内,主要时间是用于算子的编译,实际训练时间在20秒内,而单卡的脚本编译加上训练的时间在10分钟左右。device目录的train.log中运行的日志示例如下:
epoch: 1 step: 156, loss is 2.0084016
epoch: 2 step: 156, loss is 1.6407638
epoch: 3 step: 156, loss is 1.6164391
epoch: 4 step: 156, loss is 1.6838071
epoch: 5 step: 156, loss is 1.6320667
epoch: 6 step: 156, loss is 1.3098773
epoch: 7 step: 156, loss is 1.3515002
epoch: 8 step: 156, loss is 1.2943741
epoch: 9 step: 156, loss is 1.2316195
epoch: 10 step: 156, loss is 1.1533381
从上面Resnet-50的分布式并行训练的案例,我们不难看出,分布式并行相比单卡引入了一些开发上的复杂性:
同时,分布式训练下的调试调优也会更难些。虽然引入了这些开发调试的成本,但是分布式训练带来的性能收益可能远高于开发调试的开销,以Resnet-50为例,在昇腾芯片上8卡的性能超过单卡的X倍。因此对于训练要求较高的模型,使用分布式训练是更好的选择。
[1] 分布式训练完整样例代码