1、在训练YOLOv8的时候常常可能遇到其他问题导致中断训练,由于yolov8的代码结构和yolov5有一定的差异,导致需要恢复训练和yolov5的有所不同。
第一种是按照官方给出的恢复训练代码,用yolo命令格式,这种情况必须是环境以安装了yolo和ultralytics两个包,以下是yolov8断点恢复训练:
yolo task=detect mode=train model=runs/detect/exp/weights/last.pt data=ultralytics/datasets/mydata.yaml epochs=100 save=True resume=True
第二种情况是没有安装上诉两个包,或者不用上诉两个包训练而使用python脚本训练。在ultralytics/yolo/cfg/default.yaml,把resume改成True发现并没有作用,其使用脚本恢复训练,需要更改:
from ultralytics import YOLO
model = YOLO('runs/detect/exp/weights/last.pt')
results = model.train(data="ultralytics/datasets/mydata.yaml", epochs=100, device='0',
batch=8,save=True, resume=True
) # 断点恢复训练模型
这样做我没有成功过,个人感觉应该是yolov8自身的bug。
2、在代码中更改达到恢复训练的效果
2.1 在ultralytics/yolo/engine/trainer.py中找到check_resume和resume_training.
注释check_resume中resume = self.args.resume,改成需要断点恢复的last.pt
在resume_training里面添加一行ckpt的值
def check_resume(self):
# resume = self.args.resume
resume = 'runs/detect/exp/weights/last.pt';
if resume:
try:
last = Path(
check_file(resume) if isinstance(resume, (str,
Path)) and Path(resume).exists() else get_latest_run())
self.args = get_cfg(attempt_load_weights(last).args)
self.args.model, resume = str(last), True # reinstate
except Exception as e:
raise FileNotFoundError("Resume checkpoint not found. Please pass a valid checkpoint to resume from, "
"i.e. 'yolo train resume model=path/to/last.pt'") from e
self.resume = resume
def resume_training(self, ckpt):
ckpt = torch.load('runs/detect/exp/weights/last.pt')
if ckpt is None:
return
best_fitness = 0.0
start_epoch = ckpt['epoch'] + 1
if ckpt['optimizer'] is not None:
self.optimizer.load_state_dict(ckpt['optimizer']) # optimizer
best_fitness = ckpt['best_fitness']
if self.ema and ckpt.get('ema'):
self.ema.ema.load_state_dict(ckpt['ema'].float().state_dict()) # EMA
self.ema.updates = ckpt['updates']
if self.resume:
assert start_epoch > 0, \
f'{self.args.model} training to {self.epochs} epochs is finished, nothing to resume.\n' \
f"Start a new training without --resume, i.e. 'yolo task=... mode=train model={self.args.model}'"
LOGGER.info(
f'Resuming training from {self.args.model} from epoch {start_epoch + 1} to {self.epochs} total epochs')
if self.epochs < start_epoch:
LOGGER.info(
f"{self.model} has been trained for {ckpt['epoch']} epochs. Fine-tuning for {self.epochs} more epochs.")
self.epochs += ckpt['epoch'] # finetune additional epochs
self.best_fitness = best_fitness
self.start_epoch = start_epoch
ps:查看博客有的说需要把ultralytics/yolo/engine/model.py中的self.trainer.model = self.model注释掉,其实不需要。因为原版的model.py没有这一句,原版是下面注释掉的三句,其原因是在运行训练脚本时如果不注释 if not overrides.get('resume') 语句会导致终端打印两次网络结构。
如下:
def train(self, **kwargs):
"""
Trains the model on a given dataset.
Args:
**kwargs (Any): Any number of arguments representing the training configuration.
"""
self._check_is_pytorch_model()
if self.session: # Ultralytics HUB session
if any(kwargs):
LOGGER.warning('WARNING ⚠️ using HUB training arguments, ignoring local training arguments.')
kwargs = self.session.train_args
check_pip_update_available()
overrides = self.overrides.copy()
overrides.update(kwargs)
if kwargs.get('cfg'):
LOGGER.info(f"cfg file passed. Overriding default params with {kwargs['cfg']}.")
# overrides = yaml_load(check_yaml(kwargs['cfg']))
overrides = yaml_load(check_yaml(kwargs['cfg']), append_filename=False) ################
overrides['mode'] = 'train'
if not overrides.get('data'):
raise AttributeError("Dataset required but missing, i.e. pass 'data=coco128.yaml'")
if overrides.get('resume'):
overrides['resume'] = self.ckpt_path
self.task = overrides.get('task') or self.task
self.trainer = TASK_MAP[self.task][1](overrides=overrides, _callbacks=self.callbacks)
# if not overrides.get('resume'): # manually set model only if not resuming
# self.trainer.model = self.trainer.get_model(weights=self.model if self.ckpt else None, cfg=self.model.yaml)
# self.model = self.trainer.model
############################
self.trainer.model = self.model
#############################
self.trainer.hub_session = self.session # attach optional HUB session
self.trainer.train()
# Update model and cfg after training
if RANK in (-1, 0):
self.model, _ = attempt_load_one_weight(str(self.trainer.best))
self.overrides = self.model.args
self.metrics = getattr(self.trainer.validator, 'metrics', None) # TODO: no metrics returned by DDP
2.2 完成上诉两个地方,直接运行python训练脚本就行,比如我这里直接运行之前的训练脚本:
from ultralytics import YOLO
model = YOLO('ultralytics/models/v8/yolov8s.yaml')
results = model.train(data="ultralytics/datasets/mydata.yaml", epochs=200, device='0',
batch=8,
) # 训练模型
运行结果如下:
Resuming training from runs/detect/exp/weights/last.pt from epoch 96 to 200 total epochs
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/exp
Starting training for 200 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
96/200 5.39G 0.7552 0.4572 1.052 52 640: 100%|██████████| 605/605 [01:47<00:00, 5.61it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 76/76 [00:09<00:00, 7.78it/s]
all 1209 13079 0.893 0.899 0.954 0.793
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
97/200 4.29G 0.7519 0.4552 1.058 169 640: 12%|█▏ | 75/605 [00:11<01:24, 6.26it/s]^C
3、改变yolov8的训练次数
3.1 减少训练次数
在ultralytics/yolo/engine/trainer.py中找到self.epochs = self.args.epochs,把self.epochs 给成自己需要的次数。如下是把原本的200次训练减少到100次训练。
self.batch_size = self.args.batch
# self.epochs = self.args.epochs
self.epochs = 100
self.start_epoch = 0
if RANK == -1:
print_args(vars(self.args))
# Device
if self.device.type == 'cpu':
self.args.workers = 0 # faster CPU training as time dominated by inference, not dataloading
然后再根据上诉2中恢复训练的步骤,运行训练脚本即可,self.epochs减少到100次训练结果如下:
Resuming training from runs/detect/exp/weights/last.pt from epoch 97 to 100 total epochs
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/orange6000_small
Starting training for 100 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
97/100 5.39G 0.7219 0.428 1.032 52 640: 100%|██████████| 605/605 [01:49<00:00, 5.52it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 76/76 [00:09<00:00, 7.64it/s]
all 1209 13079 0.893 0.897 0.952 0.792
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
98/100 5.13G 0.7404 0.44 1.041 75 640: 100%|██████████| 605/605 [01:50<00:00, 5.49it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 76/76 [00:09<00:00, 7.95it/s]
all 1209 13079 0.898 0.893 0.953 0.791
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
99/100 4.6G 0.728 0.4323 1.04 110 640: 100%|██████████| 605/605 [01:48<00:00, 5.57it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 76/76 [00:09<00:00, 7.74it/s]
all 1209 13079 0.898 0.894 0.953 0.793
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
100/100 6.1G 0.7311 0.4284 1.033 112 640: 100%|██████████| 605/605 [01:47<00:00, 5.61it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 76/76 [00:17<00:00, 4.40it/s]
all 1209 13079 0.898 0.894 0.954 0.793
4 epochs completed in 0.136 hours.
3.2 增加训练次数
如果需要增加训练次数把self.epochs改成自己需要的次数,比如以下是200次恢复训练,需要提高到300次。
Resuming training from runs/detect/orange6000/weights/last.pt from epoch 4 to 200 total epochs
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/orange6000_small
Starting training for 200 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
4/200 5.39G 2.45 2.368 2.909 52 640: 100%|██████████| 605/605 [02:55<00:00, 3.44it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 76/76 [00:17<00:00, 4.29it/s]
all 1209 13079 0.65 0.658 0.69 0.41
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
5/200 3.36G 1.691 1.489 1.941 163 640: 1%| | 7/605 [00:01<01:57, 5.09it/s]^C
把self.epochs改成self.epochs=300即可,结果如下:
Resuming training from runs/detect/orange6000/weights/last.pt from epoch 5 to 300 total epochs
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/orange6000_small
Starting training for 300 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
5/300 4.9G 3.447 4.066 4.216 112 640: 14%|█▍ | 86/605 [00:25<02:31, 3.43it/s]^C
最后,重要的事说3遍。
训练完成后把ultralytics/yolo/engine/trainer.py代码复原!!!
训练完成后把ultralytics/yolo/engine/trainer.py代码复原!!!
训练完成后把ultralytics/yolo/engine/trainer.py代码复原!!!