之前了解了视频数据加载和处理的过程。mmaction2-master/mmaction/tools/train.py代码中对应的部分进行分析来看一下视频训练的过程。
1、首先在tools/train.py的main()函数中调用train_model()函数进行训练模型
2、然后train_model()函数中
该函数中最重要的部分为,生成runner,运行runner.run()开始执行训练过程。runner在单gpus训练过程中为EpochBasedRunner的子类。
Runner = OmniSourceRunner if cfg.omnisource else EpochBasedRunner # 采用EpochBasedRunner
runner = Runner( # 将数据模型和优化器全部装载到runner中
model,
optimizer=optimizer,
work_dir=cfg.work_dir,
logger=logger,
meta=meta)
runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
train_model()的其他代码部分为模型运行配置和一些策略的设置。
3、runner.run()函数在其子类EpochBaseRunner中实现
下面给出run()中有关训练的关键代码,其中执行epoch_runnner()函数进行训练
class EpochBasedRunner(BaseRunner):
def run(self,
data_loaders:List[DtaLoarder],
workflow: List[Tuple[str, int]],
max_epochs: Optional[int] = None,
**kwargs):
work_dir = self.work_dir if self.work_dir is not None else 'NONE'
# 执行训练的迭代过程
while self.epoch < self._max_epochs: # 执行一次循环self.epochs会加一
for i, flow in enumerate(workflow): # workflow记录当前训练流信息 状态+epoch信息
mode, epochs = flow # epoch为一个常量
if isinstance(mode, str): # self.train()
if not hasattr(self, mode):
raise ValueError(
f'runner has no method named "{mode}" to run an '
'epoch')
epoch_runner = getattr(self, mode)
else:
raise TypeError(
'mode in workflow must be a str, but got {}'.format(
type(mode)))
for _ in range(epochs): # 执行一个迭代
if mode == 'train' and self.epoch >= self._max_epochs:
break
epoch_runner(data_loaders[i], **kwargs) # 训练使用的入口
time.sleep(1) # wait for some hooks like loggers to finish
self.call_hook('after_run')
4、运行epoch_runner()函数,会执行EpochBasedRunner类下train()函数,在train中执行self.run_iter()完成一个batch的运算
def train(self, data_loader, **kwargs):
self.model.train()
self.mode = 'train'
self.data_loader = data_loader
self._max_iters = self._max_epochs * len(self.data_loader)
self.call_hook('before_train_epoch')
time.sleep(2) # Prevent possible deadlock during epoch transition
for i, data_batch in enumerate(self.data_loader):
self.data_batch = data_batch
self._inner_iter = i
self.call_hook('before_train_iter')
self.run_iter(data_batch, train_mode=True, **kwargs) # 执行一次迭代
self.call_hook('after_train_iter')
del self.data_batch
self._iter += 1
self.call_hook('after_train_epoch')
self._epoch += 1
5、执行run_iter()函数,该函数中的model.train_step()为训练一个迭代的过程
def run_iter(self, data_batch: Any, train_mode: bool, **kwargs):
if self.batch_processor is not None:
outputs = self.batch_processor(
self.model, data_batch, train_mode=train_mode, **kwargs)
elif train_mode:
outputs = self.model.train_step(data_batch, self.optimizer, # 执行模型下面的训练过程
**kwargs)
else:
outputs = self.model.val_step(data_batch, self.optimizer, **kwargs)
if not isinstance(outputs, dict):
raise TypeError('"batch_processor()" or "model.train_step()"'
'and "model.val_step()" must return a dict')
if 'log_vars' in outputs:
self.log_buffer.update(outputs['log_vars'], outputs['num_samples'])
self.outputs = outputs
6、跳转到train_step()函数该函数在模型的基类recognizers/base.py中实现,为所有识别模型共有的部分,中在self()语句中执行模型的前向传播并根据标签生成损失
def train_step(self, data_batch, optimizer, **kwargs):
imgs = data_batch['imgs']
label = data_batch['label']
aux_info = {}
for item in self.aux_info:
assert item in data_batch
aux_info[item] = data_batch[item]
losses = self(imgs, label, return_loss=True, **aux_info) # 执行模型的forward来进行计算损失
loss, log_vars = self._parse_losses(losses)
outputs = dict(
loss=loss,
log_vars=log_vars,
num_samples=len(next(iter(data_batch.values()))))
return outputs
7、在forward()函数中,根据训练状态选择进行模型的训练或者测试状态
def forward(self, imgs, label=None, return_loss=True, **kwargs):
"""Define the computation performed at every call."""
if kwargs.get('gradcam', False):
del kwargs['gradcam']
return self.forward_gradcam(imgs, **kwargs)
if return_loss:
if label is None:
raise ValueError('Label should not be None.')
if self.blending is not None:
imgs, label = self.blending(imgs, label)
return self.forward_train(imgs, label, **kwargs) # 调用子类中的具体forward_train函数来执行训练过程
return self.forward_test(imgs, **kwargs)
8、通过forward_train()函数转到具体识别器中的训练过程
def forward_train(self, imgs, labels, **kwargs):
"""Defines the computation performed at every call when training."""
assert self.with_cls_head
imgs = imgs.reshape((-1, ) + imgs.shape[2:]) # [8,1,3,32,224,224]->[8,3,32,224,224] bs*N*C*T*H*W 将bs和num_clips维度合并
losses = dict()
x = self.extract_feat(imgs) # [8,3,32,224,224]->[8,2048,4,7,7] 使用3D卷积神经网络对视频帧进行特征提取
if self.with_neck:
x, loss_aux = self.neck(x, labels.squeeze())
losses.update(loss_aux)
cls_score = self.cls_head(x) # [bs, cls_num]
gt_labels = labels.squeeze()
loss_cls = self.cls_head.loss(cls_score, gt_labels, **kwargs)
losses.update(loss_cls)
return losses
在数据处理时会将bs和num_clips维度进行合并,最后得到的输出为bs*num_clips,而标签只有bs数量个。由此在计算损失时会出现错误。因此视频识别的输入中num_clips被限制为只能为1。即一个视频只能选取一个视频片段,用其来代表整个视频来完成任务。
而如果输入视频是一个较长的视频时,一个视频片段很难代表一整个视频。所以对于长视频的视频分类,需要在原来的基础上进行调整。