Gpipe流水线其存在两个问题:硬件利用率低,内存占用大.worker 之间只能同时处理一个 minibatch,系统中只有一个minibatch是活动的,这极大限制了硬件利用率。假如一个batch被分为 n 个micro-batch,则需要缓存 n 份activation。
1F1B 策略可以解决缓存 activation 的份数问题,使得 activation 的缓存数量只跟 stage 数相关,从而进一步节省显存,可以训练更大的模型。
由于前向计算的 activation 需要等到对应的后向计算完成后才能释放(无论有没有使用 Checkpointing 技术),因此在流水并行下,如果想尽可能节省缓存 activation 的份数,就要尽量缩短每份 activation 保存的时间,也就是让每份 activation 都尽可能早的释放,所以要让每个 micro-batch 的数据尽可能早的完成后向计算,因此需要把后向计算的优先级提高,让 micro-batch 标号小的后向比 micro-batch 标号大的前向先做。因此,如果我们让最后一个 stage 在做完一次 micro-batch 的前向后,立马就做本 micro-batch 的后向,那么我们就能让其他的 stage 尽可能早的开始后向计算,这就是 1F1B 策略。
1F1B(one-forward-one-backward)的调度模式会在每台worker机器上 交替进行小批次数据的前向后向计算,同时确保这些小批量在"后向传播"时可以路由到"前向传播"的相同worker。
本文以 alexnet 为例子,分析这四个步骤在 deepspeed 源码中的实现,其中实现方式与 pipeDream 有些许不同
[23296, 0, 0, 307392, 0, 0, 663936, 0, 884992, 0, 590080, 0, 0, 0, 0, 0, 37752832, 0, 0, 16781312, 0, 40970]
self.parts= [0,19,22]
self._local_start = 0
self._local_stop = 19
self._local_start = 19
self._local_stop = 22
# DeepSpeedExamples/pipeline_parallelism/ds_config.json
{
"train_batch_size" : 256,
"train_micro_batch_size_per_gpu" : 128,
...
...
}
启动 deepspeed 时配置超参数 -p 设置流水并行数,如果 micro batch num == pp num ,则此时是最佳实践配置
# DeepSpeedExamples/pipeline_parallelism/run.sh
deepspeed train.py --deepspeed_config=ds_config.json -p 2 --steps=1
快速运行一个分布式流水并行度=2的 alexnet 网络, 如果设置可用 cuda 数量为 2,此时数据并行度将为 0,因为仅有的两个 rank 将会被用来进行流水并行
Quick Start:
export CUDA_VISIBLE_DEVICES=0,1
sh run.sh
pp=2,因此,RANK 0 and RANK 1 的进程将会被同时启动,也就是说下面的 main 函数将会分别被 RANK 0 & RANK 1 调用。通过 os 函数 os.getenv('RANK')
可以查看当前函数调用所在的卡
if __name__ == '__main__':
# __main__ function will be recall 4 times cause open four threads
args = get_args()
deepspeed.init_distributed(dist_backend=args.backend)
args.local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(args.local_rank)
if args.pipeline_parallel_size == 0:
train_base(args)
else:
train_pipe(args)
1.1. deepspeed.init_distributed(dist_backend=args.backend)
主要是为了初始化通信方法,通信方法可以参考 https://zhuanlan.zhihu.com/p/465967735,https://zhuanlan.zhihu.com/p/79030485
Create a torch backend object, initialize torch distributed, and assign to cdb
# miniconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py
# Main DeepSpeed Comms. public API.
def init_distributed(dist_backend="nccl",
auto_mpi_discovery=True,
distributed_port=TORCH_DISTRIBUTED_DEFAULT_PORT,
verbose=True,
timeout=default_pg_timeout,
init_method=None,
dist_init_required=None,
config=None):
1.2 train_pipe(args)
def train_pipe(args, part='parameters'):
...
...
net = AlexNet(num_classes=10)
net = PipelineModule(layers=join_layers(net),
loss_fn=torch.nn.CrossEntropyLoss(),
num_stages=args.pipeline_parallel_size,
partition_method=part,
activation_checkpoint_interval=0)
trainset = cifar_trainset(args.local_rank)
engine, _, _, _ = deepspeed.initialize(
args=args,
model=net,
model_parameters=[p for p in net.parameters() if p.requires_grad],
training_data=trainset)
for step in range(args.steps):
loss = engine.train_batch()
1.2.0 layers=join_layers(net)
将 视觉上的直观 layers 保存为 feature, avgpool,classifier 为顺序的数组传入 pipeModule 中
def join_layers(vision_model):
layers = [
*vision_model.features,
vision_model.avgpool,
lambda x: torch.flatten(x, 1),
*vision_model.classifier,
]
return layers
layers:
[Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)), ReLU(inplace=True), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)), ReLU(inplace=True), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), AdaptiveAvgPool2d(output_size=(6, 6)), <function join_layers.<locals>.<lambda> at 0x7fc8a6193550>, Dropout(p=0.5, inplace=False), Linear(in_features=9216, out_features=4096, bias=True), ReLU(inplace=True), Dropout(p=0.5, inplace=False), Linear(in_features=4096, out_features=4096, bias=True), ReLU(inplace=True), Linear(in_features=4096, out_features=10, bias=True)]
1.2.1 PipelineModule(layers=join_layers(net),…)
Setup world info:
#
# dist.new_group() 将 RANK 实例放入一个组中
self.world_group = dist.new_group(ranks=range(dist.get_world_size()))
self.global_rank = dist.get_rank(group=self.world_group)
self.world_size = dist.get_world_size(group=self.world_group)
self.local_rank = int(os.environ.get("LOCAL_RANK", None))
world_group=<torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7fc8a61af7f0> global_rank =1 world_size=2 local_rank=1
world_group=<torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f0e9c8af970> global_rank =0 world_size=2 local_rank=0
Initialize partition information
#
self._layer_specs = list(layers)
self._num_layers = len(self._layer_specs)
self._local_start = 0
self._local_stop = None
self._partition_layers(method=partition_method)
使用简单的 parameter method 通过迭代计算,使用 _count_layer_params()
方法计算出模型每一层的参数量
1.2.1.1 _partition_layers()
elif method == 'parameters':
param_counts = self._count_layer_params()
para_len=len(param_counts)
...
# profile count para
def _count_layer_params(self):
"""Count the trainable parameters in individual layers.
This routine will only build one layer at a time.
Returns:
A list of the number of parameters in each layer.
"""
param_counts = [0] * len(self._layer_specs)
for idx, layer in enumerate(self._layer_specs):
if isinstance(layer, LayerSpec):
l = layer.build()
params = filter(lambda p: p.requires_grad, l.parameters())
param_counts[idx] = sum(p.numel() for p in params)
elif isinstance(layer, nn.Module):
params = filter(lambda p: p.requires_grad, layer.parameters())
param_counts[idx] = sum(p.numel() for p in params)
return param_counts
得到一个映射每层参数量的数组:
[23296, 0, 0, 307392, 0, 0, 663936, 0, 884992, 0, 590080, 0, 0, 0, 0, 0, 37752832, 0, 0, 16781312, 0, 40970]
跟据每一层的参数量,使用 partition_balanced()
方法进行简单的 stage 划分,平衡每张卡的计算量:
...
self.parts = ds_utils.partition_balanced(weights=param_counts,
num_parts=num_stages)
# compute partition miniconda3/lib/python3.9/site-packages/deepspeed/runtime/utils.py
def partition_balanced(weights, num_parts, eps=1e-3):
num_items = len(weights)
# First check for the trivial edge case
if num_items <= num_parts:
return partition_uniform(num_items, num_parts)
weights_ = prefix_sum_inc(weights)
# Find the smallest bottleneck (weight of heaviest partition)
bottleneck = _rb_partition_balanced(weights_, num_parts, eps=eps)
# Now compute that partitioning
parts, success = _lprobe(weights_, num_parts, bottleneck)
print(f':::::::::::::::;part::::::::::::::{parts}::::::::::sucess:::::::{success}')
assert success
return parts
得到一个 global 的数组 保存分层的 index,例如下面的数组表示,stage 0 将运行 层0~层19,stage 1 将运行 层19~层22
self.parts= [0,19,22]
_set_bounds()
方法通过传入 rank 的 id(即 stage id)可以使得每张卡存储不同的私有的变量,以实现分区
...
self._set_bounds(start=self.parts[stage_id], stop=self.parts[stage_id + 1])
# def _set_bounds
def _set_bounds(self, start=None, stop=None):
"""Manually define the range of layers that will be built on this process.
These boundaries are treated as list slices and so start is inclusive and stop is
exclusive. The default of None for both results in all layers being built
locally.
"""
self._local_start = start
self._local_stop = stop
print(f'::::::::::::_local_start:;:::{self._local_start}')
print(f':::::::::::_local_stop:::::::{self._local_stop}')
...
self.forward_funcs = []
self.fwd_map = {}
self.tied_modules = nn.ModuleDict()
self.tied_weight_attrs = {}
self._build()
self._local_start = 0
self._local_stop = 19
self._local_start = 19
self._local_stop = 22
::RANk0:::::::::::forward_funcs:::::[Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)), ReLU(inplace=True), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)), ReLU(inplace=True), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), AdaptiveAvgPool2d(output_size=(6, 6)), <function join_layers.<locals>.<lambda> at 0x7f6308dde550>, Dropout(p=0.5, inplace=False), Linear(in_features=9216, out_features=4096, bias=True), ReLU(inplace=True), Dropout(p=0.5, inplace=False)]::::::::;fwd_map:::::{'0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7, '8': 8, '9': 9, '10': 10, '11': 11, '12': 12, '13': 13, '15': 15, '16': 16, '17': 17, '18': 18}:::::::tied_modules:::::ModuleDict():::::::
::RANk1:::::::::::forward_funcs:::::[Linear(in_features=4096, out_features=4096, bias=True), ReLU(inplace=True), Linear(in_features=4096, out_features=10, bias=True)]::::::::;fwd_map:::::{'19': 0, '20': 1, '21': 2}:::::::tied_modules:::::ModuleDict():::::::
如何做到分区?
我们知道程序一开始便开启了两个 RANK 进程,即每个 RANK 都会运行 global 的代码.当每个 RANK 运行到 set_bound 函数时,传入的参数 stage_id 是通过 get_coord(global_rank) 得到的,即不同的 RANK 执行到这一步时,会得到不同的 stage_id 此时,不同 RANK 上保存的 self._local_start 将会不同.
再使用 _build 函数,根据每个 RANK 所保存的不同 self._local_start & stop map 一下所对应的 module forward 函数名字 得到 RANK 私有的 self.forward_funcs & self.fwd_map,之后便可以使用 self.module() 内存有不同的 module 数据,此时每张卡内已经存有不同的 model 结构了.之后初始化 engine 时,传入的便已经是分区好的model.
1.2.2 trainset = cifar_trainset(args.local_rank)
...
dist.barrier()
if local_rank != 0:
dist.barrier()
trainset = torchvision.datasets.CIFAR10(root=dl_path,
train=True,
download=True,
transform=transform)
if local_rank == 0:
dist.barrier()
return trainset
dist.barrier()
pytorch在分布式训练过程中,对于数据的读取是采用主进程预读取并缓存,然后其它进程从缓存中读取,不同进程之间的数据同步具体通过torch.distributed.barrier()实现。
在上面的代码示例中,如果执行 cifar_trainset() 函数的进程不是主进程,即rank不等于0,会执行相应的 torch.distributed.barrier(),设置一个阻塞栅栏,让此进程处于等待状态,等待所有进程到达栅栏处(包括主进程数据处理完毕);如果执行create_dataloader()函数的进程是主进程,其会直接去读取数据并处理,然后其处理结束之后会接着遇到torch.distributed.barrier(),此时,所有进程都到达了当前的栅栏处,这样所有进程就达到了同步,并同时得到释放。
1.2.3 deepspeed.initialize(…)
engine = PipelineEngine(...)
1.2.4 loss = engine.train_batch()
def train_batch(self, data_iter=None):
self.module.train() #????????????????????
self.timers('train_batch').start()
sched = schedule.TrainSchedule(micro_batches=self.micro_batches,
stages=self.num_stages,
stage_id=self.stage_id)
self._exec_schedule(sched)
self.agg_train_loss = self._aggregate_total_loss()
1.2.4.1 schedule.TrainSchedule(micro_batches=self.micro_batches…)
runtime schedule 将会自动调用 steps 函数
# miniconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/schedule.py
class TrainSchedule(PipeSchedule):
def steps(self):
..
..
yield cmds
cmds result:
[RecvActivation(buffer_id=0), LoadMicroBatch(buffer_id=0), ForwardPass(buffer_id=0)]
其中该图展现了运行时的物理结构,分为两个 stage 进行流水并行,每个 stage 有 6 步,stage 内串行执行,stage 间并行执行。