Pytorch 单机多卡训练DDP

多卡训练方式

1.DP——torch.nn.DataParallel
2.DDP——torch.nn.parallel.DistributedDataParallel 通俗一点讲就是用了4张卡训练,就会开启4个程序同时运行,每个卡运行总batch的1/4

方法比较

方法1简单,但是这种方式训练有不足之处。方法2要改动的地方比较多,但是速度更快。而且当模型很大的时候使用DataParallel我遇到了一个问题,报错说模型参数不在一个device上,这很有可能是单张卡放不下这些参数,但是具体的原因我也不清楚,改成DDP模式后即可正常运行。

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

DDP使用

关键代码如下,根据自己需求修改。
代码解释
args.local_rank 指的是当前进程所用的设备,我这里用了4个卡,torch.distributed.launch运行时会更新4次local_rank
word_size 进程总数,一般一张卡开一个进程,可以理解为GPU个数,这里用到4张卡,开4个进程
train_sampler 需要把DataLoader里的shuffle=True换成train_sampler
dist.barrier() 同步屏障,在这里等待所有进程到此后才能进行训练
train_sampler.set_epoch(epoch) 保证每个进程拿到的数据都不同
–nproc_per_node=4 这个参数是给torch.distributed.launch的,让它开启4个进程,这个参数和word_size要一样

from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
import torch.distributed as dist


parser.add_argument('--local_rank', type=int,default=0,help='local device id on current node')
parser.add_argument('--word_size', default=4,help="n_gpus")
parser.add_argument('--init_method', default='tcp://127.0.0.1:2345',help="init-method")
args = parser.parse_args()

device = torch.device('cuda',args.local_rank)
dist.init_process_group(backend='nccl',rank=args.local_rank, world_size=args.word_size)
torch.cuda.set_device(device)
model = torch.nn.parallel.DistributedDataParallel(model.cuda(args.local_rank),
                                                        device_ids=[args.local_rank],
                                                        output_device=args.local_rank)

train_sampler  = DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, 	
											batch_size=args.batch_size,
											drop_last=True,
											sampler=train_sampler)#这里要把shuffle换成sampler

for epoch in range(1, args.epochs + 1):
    dist.barrier()
    ...
   for batch_idx, batch in enumerate(train_loader):   
   		train_sampler.set_epoch(epoch) 
   		...

后台运行

CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -u -m torch.distributed.launch --nproc_per_node=4 train.py >train.log 2>&1 &

直接运行

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py

保存模型
加一句判断,不然一次性4个进程都会给你保存模型

if args.local_rank==0:
	save...

加载模型
注意:分布式训练得到的模型字典的keys会比原来多了一个module,比如‘conv.weight’会变成’module.conv.weight‘,所以在加载权重之前,先把原有模型的字典也加上’module‘才能使得字典匹配,DataParallel的包装可以实现这种功能。当然也可以去修改训练好的模型字典,这样比较麻烦。

model = torch.nn.DataParallel(model)

注意,在DDP模式中Batchnorm会出现inplace操作导致梯度无法反传,可以用nn.SyncBatchNorm()替代nn.BatchNormxd()

(Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
...

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

如果必须要用到batchnorm,按照这个设置将DDP模型broadcast_buffers设置为False,https://github.com/pytorch/pytorch/issues/22095

model = torch.nn.parallel.DistributedDataParallel(model.cuda(args.local_rank),
                                                        device_ids=[args.local_rank],
                                                        output_device=args.local_rank,
                                                        broadcast_buffers=False)

你可能感兴趣的:(pytorch)