yolov5毫无疑问是目前目标检测框架中非常准确快速的检测框架之一,在工业界和学术界应用广泛,其优势不言而喻。
在模型训练或推理时,我们都想快速完成,特别是数据量很大的时候,效率就是非常迫切需要提升的。这里简单介绍一下yolov5的多种训练方法,便于理解深度学习的模型训练方法,同时基于自身的硬件条件选择高效的训练方法。
yolov5训练方法的官方网站:https://github.com/ultralytics/yolov5/issues/475
python train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0
多卡DP训练 不推荐
官方也不推荐该方法,该方法训练的时候速度快不了多少,而且该方法训练时把数据放到多张卡上,但是计算结果在主卡上进行,会导致主卡和其他卡的内存使用不平衡,不推荐。
python train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1
官方说该方法慢,训练时比起单卡加速很小,This method is slow and barely speeds up training compared to using just 1 GPU
.
强推多卡DDP方法
需要通过设置 python -m torch.distributed.run --nproc_per_node
运行命令如下
python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1
--nproc_per_node
表示使用的GPU数量,specifies how many GPUs you would like to use. In the example above, it is 2.
--batch
表示批量处理的图片数量,会被均分到每张卡上,the total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.
示例:
python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --device 2,3
关于DDP方法使用 SyncBatchNorm
,使用SyncBatchNorm
可以提升精度,但是会降低训练速度,而且只适用于DDP,平分到每张卡上的 batch-size <= 8
时效果更好, 通过命令行增加参数标志--sync-bn
执行
SyncBatchNorm could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training.
It is best used when the batch-size on each GPU is small (<= 8).
To use SyncBatchNorm, simple pass --sync-bn to the command like below,
python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --sync-bn
多台机器训练,多台机器训练需要保持机器之间的通信,其效率会受一定影响,官方的多机器训练设置命令:
# On machine R
python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank R --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''
where G
is number of GPU per machine, N
is the number of machines, and R
is the machine number from 0…(N-1).
Let’s say I have two machines with two GPUs each
, it would be G = 2 , N = 2, and R = 1
for the above.
其中G是每台机器的GPU数量,N是机器数量,R是机器序号,表示汇总到哪台机器(master machine)
Training will not start until all N machines are connected. Output will only be shown on master machine
!
DP方法的代码:
# DP mode
if cuda and RANK == -1 and torch.cuda.device_count() > 1:
LOGGER.warning(
'WARNING ⚠️ DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.\n'
'See Multi-GPU Tutorial at https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training to get started.'
)
model = torch.nn.DataParallel(model)
是否使用SyncBatchNorm
:
# SyncBatchNorm
if opt.sync_bn and cuda and RANK != -1:
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
LOGGER.info('Using SyncBatchNorm()')
使用DDP方法:
# DDP mode
if cuda and RANK != -1:
model = smart_DDP(model)
其中涉及smart_DDP
的代码:
from torch.nn.parallel import DistributedDataParallel as DDP
def smart_DDP(model):
# Model DDP creation with checks
assert not check_version(torch.__version__, '1.12.0', pinned=True), \
'torch==1.12.0 torchvision==0.13.0 DDP training is not supported due to a known issue. ' \
'Please upgrade or downgrade torch to use DDP. See https://github.com/ultralytics/yolov5/issues/8395'
if check_version(torch.__version__, '1.11.0'):
return DDP(model, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK, static_graph=True)
else:
return DDP(model, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK)
从上可以看到DDP训练不支持 torch==1.12.0 torchvision==0.13.0
版本的库,torch1.11.0
需要单独设置。