TSN源码阅读

目录

  • 一、项目结构
    • 1.py文件解释
    • 2.函数组成及调用关系
    • 3.IPO图
  • 二、opts.py解读
  • 三、main.py解读
    • 1.整体架构
    • 2.具体代码解读
  • 四、models.py解读
    • 1.整体架构
    • 2.具体代码解读
  • 五、dataset.py解读
    • 1.整体架构
    • 2.具体代码解读
  • 六、test_models.py解读
    • 1.整体框架
    • 2.具体代码解读
  • 七、transforms.py解读
  • 八、basic_ops.py解读
  • 九、utils.py解读
  • 瞎写八写在最后

一、项目结构

1.py文件解释

main.py 训练脚本
test_models.py 测试脚本
opts.py 参数配置脚本
dataset.py 数据读取脚本
models.py 网络结构构建脚本
transforms.py 数据预处理相关脚本
tf_model_zoo 文件夹关于导入模型结构的脚本

2.函数组成及调用关系

TSN源码阅读_第1张图片

3.IPO图

下图展示了,TSN如何将UCF-101数据集提出的帧进行分类的过程,标注了每一个Tensor的大小
TSN源码阅读_第2张图片
TSN源码阅读_第3张图片

二、opts.py解读

import argparse
parser = argparse.ArgumentParser(description="PyTorch implementation of Temporal Segment Networks")
parser.add_argument('dataset', type=str, choices=['ucf101', 'hmdb51', 'kinetics'])     三种类型的dataset
parser.add_argument('modality', type=str, choices=['RGB', 'Flow', 'RGBDiff'])         # 三种不同的输入
parser.add_argument('train_list', type=str)
parser.add_argument('val_list', type=str)

# ========================= Model Configs ==========================                   模型参数
parser.add_argument('--arch', type=str, default="resnet101")                         # 基本模型 默认resnet101
parser.add_argument('--num_segments', type=int, default=3)                           # segment的数值,默认3
parser.add_argument('--consensus_type', type=str, default='avg',                     # 聚合函数的选择,默认均值,choices avg、max、topk、identity、rnn、cnn
                    choices=['avg', 'max', 'topk', 'identity', 'rnn', 'cnn'])
parser.add_argument('--k', type=int, default=3)                                      # k的数值 默认3

parser.add_argument('--dropout', '--do', default=0.5, type=float,                    # dropout默认值 0.5
                    metavar='DO', help='dropout ratio (default: 0.5)')               # metavar大概可以理解为注释?
parser.add_argument('--loss_type', type=str, default="nll",
                    choices=['nll'])

# ========================= Learning Configs ==========================
parser.add_argument('--epochs', default=45, type=int, metavar='N',                   # 学习轮数默认45
                    help='number of total epochs to run')
parser.add_argument('-b', '--batch-size', default=256, type=int,                     # batch_size默认256
                    metavar='N', help='mini-batch size (default: 256)')
parser.add_argument('--lr', '--learning-rate', default=0.001, type=float,            # learning_rate默认0.001
                    metavar='LR', help='initial learning rate')
parser.add_argument('--lr_steps', default=[20, 40], type=float, nargs="+",           # 学习率每一轮下降10%
                    metavar='LRSteps', help='epochs to decay learning rate by 10')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',              # 冲量默认为0.9
                    help='momentum')
parser.add_argument('--weight-decay', '--wd', default=5e-4, type=float,              # weight_dacay默认为5e-4
                    metavar='W', help='weight decay (default: 5e-4)')
parser.add_argument('--clip-gradient', '--gd', default=None, type=float,             # clip_gradient默认为none
                    metavar='W', help='gradient norm clipping (default: disabled)')
parser.add_argument('--no_partialbn', '--npb', default=False, action="store_true")   # no_partiblbn 应该就是论文中说的部分BN,初始值为false

# ========================= Monitor Configs ==========================
parser.add_argument('--print-freq', '-p', default=20, type=int,                      # 输出的频率 默认值20次一输出
                    metavar='N', help='print frequency (default: 20)')
parser.add_argument('--eval-freq', '-ef', default=5, type=int,                       #
                    metavar='N', help='evaluation frequency (default: 5)')


# ========================= Runtime Configs ==========================                # 用到再说吧,具体跑的时候设置的参数
parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
                    help='number of data loading workers (default: 4)')
parser.add_argument('--resume', default='', type=str, metavar='PATH',
                    help='path to latest checkpoint (default: none)')
parser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true',
                    help='evaluate model on validation set')
parser.add_argument('--snapshot_pref', type=str, default="")
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
                    help='manual epoch number (useful on restarts)')
parser.add_argument('--gpus', nargs='+', type=int, default=None)
parser.add_argument('--flow_prefix', default="", type=str)

三、main.py解读

1.整体架构

1.from opts import parser,解析命令行中的参数
2.调用models.py初始化TSN模型
3.调用dataset.py导入数据
4.训练、保存模型

2.具体代码解读

import argparse
import os
import time
import shutil
import torch
import torchvision
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim
from torch.nn.utils import clip_grad_norm
from dataset import TSNDataSet
from models import TSN
from transforms import *
from opts import parser

"导入一些要用的包,其中比较重要的是 "
"导入模型:from models import TSN."
"导入配置的参数:from opts import parser."

# 最好的预测正确率!
best_prec1 = 0


def main():
    # global 全局变量
    global args, best_prec1
    # 进行cmd调参,具体设计的参数都在opts.py中
    args = parser.parse_args()

    if args.dataset == 'ucf101':
        num_class = 101
    elif args.dataset == 'hmdb51':
        num_class = 51
    elif args.dataset == 'kinetics':
        num_class = 400
    else:
        raise ValueError('Unknown dataset '+args.dataset)
    # 初始化模型
    model = TSN(num_class, args.num_segments, args.modality,base_model=args.arch,consensus_type=args.consensus_type, dropout=args.dropout, partial_bn=not args.no_partialbn)
    "TSN的定义在models.py脚本中"
    "num_class:分类的类别数"
    "args.num_segments:把一个video分成多少份,对应论文中的K,默认为K=3"
    "args.modality:采用哪种输入,比如RGB表示常规图像,Flow表示optical flow等"
    "args.arch:采用哪种模型,比如ResNet101,BNInception等"
    "rags.consensus_type:采用不同snippet融合方式,比如avg"
    "args.dropout:dropout参数"


    crop_size = model.crop_size
    scale_size = model.scale_size
    input_mean = model.input_mean
    input_std = model.input_std
    policies = model.get_optim_policies()
    train_augmentation = model.get_augmentation()

    # 多GPU与断点恢复设置,单机多卡
    # 使用torch.nn.DataParallel方法设置多GPU训练
    model = torch.nn.DataParallel(model, device_ids=args.gpus).cuda()

    # 而args.resume主要是用来设置是否从断点处继续训练,比如原来训练模型训到一半停止了,
    # 希望继续从保存的最新epoch开始训练,因此args.resume要么是默认的None,要么就是保存的模型文件(.pth)的路径
    if args.resume:
        if os.path.isfile(args.resume):
            print(("=> loading checkpoint '{}'".format(args.resume)))
            # 其中checkpoint = torch.load(args.resume)是用来导入已训练好的模型,
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            best_prec1 = checkpoint['best_prec1']
            # model.load_state_dict方法是完成导入模型的参数初始化model这个网络的过程,
            # 这也是torch.nn.Module类中的重要方法之一。
            model.load_state_dict(checkpoint['state_dict'])
            print(("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.evaluate, checkpoint['epoch'])))
        else:
            print(("=> no checkpoint found at '{}'".format(args.resume)))

    cudnn.benchmark = True

    # 数据加载,通过自定义的TSNDataSet类导入数据
    if args.modality != 'RGBDiff':
        normalize = GroupNormalize(input_mean, input_std)
    else:
        normalize = IdentityTransform()

    if args.modality == 'RGB':
        data_length = 1
    elif args.modality in ['Flow', 'RGBDiff']:
        data_length = 5
    train_loader = torch.utils.data.DataLoader(
        TSNDataSet("", args.train_list, num_segments=args.num_segments,
                   new_length=data_length,
                   modality=args.modality,
                   image_tmpl="img_{:05d}.jpg" if args.modality in ["RGB", "RGBDiff"] else args.flow_prefix+"{}_{:05d}.jpg",
                   transform=torchvision.transforms.Compose([
                       train_augmentation,
                       Stack(roll=args.arch == 'BNInception'),
                       ToTorchFormatTensor(div=args.arch != 'BNInception'),
                       normalize,
                   ])),
        batch_size=args.batch_size, shuffle=True,
        num_workers=args.workers, pin_memory=True)

    val_loader = torch.utils.data.DataLoader(
        TSNDataSet("", args.val_list, num_segments=args.num_segments,
                   new_length=data_length,
                   modality=args.modality,
                   image_tmpl="img_{:05d}.jpg" if args.modality in ["RGB", "RGBDiff"] else args.flow_prefix+"{}_{:05d}.jpg",
                   random_shift=False,
                   transform=torchvision.transforms.Compose([
                       GroupScale(int(scale_size)),
                       GroupCenterCrop(crop_size),
                       Stack(roll=args.arch == 'BNInception'),
                       ToTorchFormatTensor(div=args.arch != 'BNInception'),
                       normalize,
                   ])),
        batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True)

    # define loss function (criterion) and optimizer
    # 定义损失函数,优化器和设置一些超参数,
    # 从代码中可以看到,这里使用的是交叉熵损失函数,优化器使用SGD方式
    if args.loss_type == 'nll':
        criterion = torch.nn.CrossEntropyLoss().cuda()
    else:
        raise ValueError("Unknown loss type")

    for group in policies:
        print(('group: {} has {} params, lr_mult: {}, decay_mult: {}'.format(
            group['name'], len(group['params']), group['lr_mult'], group['decay_mult'])))

    optimizer = torch.optim.SGD(policies,
                                args.lr,
                                momentum=args.momentum,
                                weight_decay=args.weight_decay)

    # 根据args.evaluate参数判断当前是训练模式还是测试模式
    if args.evaluate:
        # 是测试模型,直接调用validate方法验证模型
        validate(val_loader, model, criterion, 0)
        return
    # 不是测试模型,是训练模型
    for epoch in range(args.start_epoch, args.epochs):
        # 调用方法,调整学习率
        adjust_learning_rate(optimizer, epoch, args.lr_steps)

        # train for one epoch
        # 调用方法,开始训练模型
        train(train_loader, model, criterion, optimizer, epoch)

        # evaluate on validation set 模型验证和保存
        # 当训练epoch到达指定值,进行模型验证保存,使用args.eval_freq控制保存的epoch值
        if (epoch + 1) % args.eval_freq == 0 or epoch == args.epochs - 1:
            prec1 = validate(val_loader, model, criterion, (epoch + 1) * len(train_loader))

            # remember best prec@1 and save checkpoint
            # 比较得到val更好的模型并保存
            is_best = prec1 > best_prec1
            best_prec1 = max(prec1, best_prec1)
            # 调用save_checkpoint保存模型参数和其他信息
            save_checkpoint({
                'epoch': epoch + 1,
                'arch': args.arch,
                'state_dict': model.state_dict(),
                'best_prec1': best_prec1,
            }, is_best)

# train函数是整个训练部分的入口,将TSN的训练部分封装起来,成为一个函数,之后在主函数直接调用即可
def train(train_loader, model, criterion, optimizer, epoch):
    batch_time = AverageMeter()
    data_time = AverageMeter()
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()

    # 自定义是否进行部分BN
    if args.no_partialbn:
        model.module.partialBN(False)
    else:
        model.module.partialBN(True)

    # 调用model.py中的train方法来对模型的参数进行预训练,
    # 并且冻结除第一层之外的所有批处理规范化层的均值和方差
    # 对全连接层的参数进行训练,达到微调的目的。
    model.train()

    # 运行数据迭代读取的循环函数
    # 当执行enumerate(train_loader)的时候,是先调用DataLoader类的__iter__方法,
    # 该方法里面再调用DataLoaderIter类的初始化操作__init__
    end = time.time()
    for i, (input, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        # 对train_loader中的数据集进行遍历,存储其中的数据集输入和真实标签
        target = target.cuda()
        input_var = torch.autograd.Variable(input)
        target_var = torch.autograd.Variable(target)

        # 执行output=model(input_var)得到模型的输入结果
        output = model(input_var)
        loss = criterion(output, target_var)

        # 调用计算损失函数之前调用accuracy函数来更新top1和top5的准确率
        prec1, prec5 = accuracy(output.data, target, topk=(1,5))
        losses.update(loss.data[0], input.size(0))
        top1.update(prec1[0], input.size(0))
        top5.update(prec5[0], input.size(0))


        # 对于梯度清零、回传和之前学习的过程一样,
        # 最后可以得到训练之后的模型和训练的loss、accuracy等
        optimizer.zero_grad()

        loss.backward()

        if args.clip_gradient is not None:
            total_norm = clip_grad_norm(model.parameters(), args.clip_gradient)
            if total_norm > args.clip_gradient:
                print("clipping gradient: {} with coef {}".format(total_norm, args.clip_gradient / total_norm))

        optimizer.step()

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            print(('Epoch: [{0}][{1}/{2}], lr: {lr:.5f}\t'
                  'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                  'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
                  'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
                  'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
                  'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
                   epoch, i, len(train_loader), batch_time=batch_time,
                   data_time=data_time, loss=losses, top1=top1, top5=top5, lr=optimizer.param_groups[-1]['lr'])))


# 验证函数validate和训练函数train类似
# 不同点是:model.eval()将模型设置为evaluate mode
# 没有optimizer.zero_grad()、loss.backward()、optimizer.step()等损失回传或梯度更新操作
def validate(val_loader, model, criterion, iter, logger=None):
    batch_time = AverageMeter()
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()

    # switch to evaluate mode
    model.eval()

    end = time.time()
    for i, (input, target) in enumerate(val_loader):
        target = target.cuda()
        input_var = torch.autograd.Variable(input, volatile=True)
        target_var = torch.autograd.Variable(target, volatile=True)

        # compute output
        output = model(input_var)
        loss = criterion(output, target_var)

        # measure accuracy and record loss
        prec1, prec5 = accuracy(output.data, target, topk=(1,5))

        losses.update(loss.data[0], input.size(0))
        top1.update(prec1[0], input.size(0))
        top5.update(prec5[0], input.size(0))

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            print(('Test: [{0}/{1}]\t'
                  'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                  'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
                  'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
                  'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
                   i, len(val_loader), batch_time=batch_time, loss=losses,
                   top1=top1, top5=top5)))

    print(('Testing Results: Prec@1 {top1.avg:.3f} Prec@5 {top5.avg:.3f} Loss {loss.avg:.5f}'
          .format(top1=top1, top5=top5, loss=losses)))

    return top1.avg

# 主要功能是保存表现最好的一个模型及其参数,其主要工作是生成模型路径,用torch.save()方法保存模型。
def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):
    filename = '_'.join((args.snapshot_pref, args.modality.lower(), filename))
    torch.save(state, filename)
    if is_best:
        best_name = '_'.join((args.snapshot_pref, args.modality.lower(), 'model_best.pth.tar'))
        shutil.copyfile(filename, best_name)

# 定义一个类AverageMeter来管理一些变量的更新,比如loss损失、top1准确率等
# 在初始化的时候,调用重置方法reset
# 当调用该类对象的update方法的时候就会进行变量更新
# 当要读取某个变量的时候,可以通过对象.属性的方式来获取
# 比如在train函数中的top1.val读取top1准确率
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

# 调整学习率函数
def adjust_learning_rate(optimizer, epoch, lr_steps):
    """Sets the learning rate to the initial LR decayed by 10 every 30 epochs"""
    # ls_steps是一个列表,里面的值表示到达多少个epoch的时候要改变学习率,
    # 在adjust_learning_rate函数中,修改学习率时是默认修改成当前的0.1倍
    decay = 0.1 ** (sum(epoch >= np.array(lr_steps)))
    lr = args.lr * decay
    decay = args.weight_decay
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr * param_group['lr_mult']
        param_group['weight_decay'] = decay * param_group['decay_mult']

# 计算准确率的函数,输入output是模型的预测结果(尺寸为batch_size*num_class),target是真实标签(尺寸是batch_size)
def accuracy(output, target, topk=(1,)):
    """Computes the precision@k for the specified values of k"""
    maxk = max(topk)
    # batch_size = target.size(0)是读取batch size值
    batch_size = target.size(0)

    # 这里调用了PyTorch中Tensor的topk方法,maxk表示要计算的是top maxk的结果
    # 1表示dim,即按行计算(dim=1)。largest=True,表示返回的是top maxk个最大值
    # sorted=True,表示返回排序结果
    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    # target.view(1, -1).expand_as(pred)先将target的尺寸规范到1*batch size,然后将维度扩充为pred相同的维度,也就是maxk*batch size,
    # 例如5*batch size,然后调用eq方法计算两个Tensor矩阵相同元素情况,得到的correct是同等维度的ByteTensor矩阵,1值表示相等,0值表示不等
    correct = pred.eq(target.view(1, -1).expand_as(pred))

    res = []
    for k in topk:
        # correct_k = correct[:k].view(-1).float().sum(0)通过k值来决定是计算top k的准确率,
        # sum(0)表示按照dim 0维度计算和,最后都添加到res列表中并返回
        correct_k = correct[:k].view(-1).float().sum(0)
        res.append(correct_k.mul_(100.0 / batch_size))
    return res


if __name__ == '__main__':
    main()

四、models.py解读

1.整体架构

1.init:初始化模型,设置参数
2._prepare_base_model:选择主体网络结构,并进行数据预处理设置
3._prepare_tsn:_prepare_base_model完事儿后,对最后的全连接层做不同的修改
4.train:重写train函数,冻结除了第一层以外的其他BN层
5.get_optim_policies:获取模型的每一层并保存参数用于优化
6.forward:向前传播
7.get_augmentation:根据输入不同获取不同的数据预处理操作
8._construct_flow_model,_construct_diff_model是optical flow和RGB Diff的数据定义

2.具体代码解读

from torch import nn
from ops.basic_ops import ConsensusModule, Identity
from transforms import *
from torch.nn.init import normal, constant

# 初始化模型
# models.py 的主要功能是对之后的训练模型进行准备。
# 使用一些经典模型作为基础模型,如resnet101、BNInception等,
# 针对不同的输入模态,对最后一层全连接层就行修改,得到我们所需的TSN网络模型。

class TSN(nn.Module):

    # __init__初始化TSN模型,设置一些参数和参数默认值,并且调用一些函数来修改TSN模型

    # num_class:分类后的标签数
    # num_segments:视频分割的段数
    # modality:输入的模态(RGB、光流、RGB diff)
    # base_model:基础模型,之后的TSN模型以此为基础来修改,默认值为resnet101
    # new_length:视频取帧的起点,RGB1,光流为5,默认值为0
    # consensus_type:选择聚合函数,默认为avg(平均池化)
    # before_softmax:是否在softmax前融合,默认为True
    # dropout:设置dropout值,默认0.8
    # crop_num:数据集修改的类别
    # partial_bn:是否部分BN,默认为True

    def __init__(self, num_class, num_segments, modality,
                 base_model='resnet101', new_length=None,
                 consensus_type='avg', before_softmax=True,
                 dropout=0.8,
                 crop_num=1, partial_bn=True):
        super(TSN, self).__init__()
        self.modality = modality
        self.num_segments = num_segments
        self.reshape = True
        self.before_softmax = before_softmax
        self.dropout = dropout
        self.crop_num = crop_num
        self.consensus_type = consensus_type
        if not before_softmax and consensus_type != 'avg':
            raise ValueError("Only avg consensus can be used after Softmax")

        # new_length为None时,作者将RGB的new_length设置为1,光流和RGB Diff的new_length设置为5
        if new_length is None:
            self.new_length = 1 if modality == "RGB" else 5
        else:
            self.new_length = new_length
        "new_length和输入数据类型相关"

        print(("""
Initializing TSN with base model: {}.
TSN Configurations:
    input_modality:     {}
    num_segments:       {}
    new_length:         {}
    consensus_module:   {}
    dropout_ratio:      {}
        """.format(base_model, self.modality, self.num_segments, self.new_length, consensus_type, self.dropout)))

        # 通过调用TSN类的_prepare_base_model方法来导入模型
        self._prepare_base_model(base_model)

        # 通过调用TSN类的_prepare_tsn方法来得到,进一步修改TSN的网络模型结构
        feature_dim = self._prepare_tsn(num_class)

        # 这两个输入的主要差别在第一个卷积层,因为该层的输入channel依据不同的输入类型而变化
        if self.modality == 'Flow':           # 如果输入的数据是optical flow,则调用_construct_flow_model方法实现输入
            print("Converting the ImageNet model to a flow init model")
            self.base_model = self._construct_flow_model(self.base_model)
            print("Done. Flow model ready...")
        elif self.modality == 'RGBDiff':      # 如果输入的数据是RGB difference,则调用_construct_diff_model方法实现输入
            print("Converting the ImageNet model to RGB+Diff init model")
            self.base_model = self._construct_diff_model(self.base_model)
            print("Done. RGBDiff model ready.")

        #ConsensusModule在basic_ops.py文件中
        self.consensus = ConsensusModule(consensus_type)

        # before_softmax = False,nn.Softmax()会在后续的网络中使用,具体用法见forward
        # before_softmax = True, nn.Softmax()不会在后续的网络中使用
        if not self.before_softmax:
            self.softmax = nn.Softmax()

        # 全是代表部分BN的意思
        self._enable_pbn = partial_bn
        if partial_bn:
            self.partialBN(True)

    # 对不同基础网络结构进行数据预处理设置,一共三种resnet101、BNInception、inception
    def _prepare_base_model(self, base_model):

        # 主要是使用getattr模块:getattr(torchvision.models, base_model)()
        # 根据base_model的不同指定值来导入不同的网络,
        # 对不同基础模型设定不同的输入尺寸、均值和方差,这些后面进行数据处理时使用,
        # 此外对光流输入和RGB差分,需要进行不同的设置。
        if 'resnet' in base_model or 'vgg' in base_model:  # 如果base_model的值为resnet或者vgg时
            self.base_model = getattr(torchvision.models, base_model)(True)
            self.base_model.last_layer_name = 'fc'   # 将resnet101中的'fc'层赋值给该变量
            self.input_size = 224           # 往resnet或者vgg网络的输入大小是224

            # 三个输入维度减去input_mean数组的对应值
            # 三个输入维度减去input_mean数组的对应值后,除以input_std数组的对应值,完成标准化操作
            self.input_mean = [0.485, 0.456, 0.406]
            self.input_std = [0.229, 0.224, 0.225]
            # Flow的input_mean为0.5,input_std为RGB的input_std的均值
            if self.modality == 'Flow':
                self.input_mean = [0.5]
                self.input_std = [np.mean(self.input_std)]
            elif self.modality == 'RGBDiff':
                self.input_mean = [0.485, 0.456, 0.406] + [0] * 3 * self.new_length
                self.input_std = self.input_std + [np.mean(self.input_std) * 2] * 3 * self.new_length

        elif base_model == 'BNInception':
            import tf_model_zoo
            self.base_model = getattr(tf_model_zoo, base_model)()
            self.base_model.last_layer_name = 'fc'
            self.input_size = 224
            self.input_mean = [104, 117, 128]
            self.input_std = [1]

            if self.modality == 'Flow':
                self.input_mean = [128]
            elif self.modality == 'RGBDiff':
                self.input_mean = self.input_mean * (1 + self.new_length)

        elif 'inception' in base_model:
            import tf_model_zoo
            self.base_model = getattr(tf_model_zoo, base_model)()
            self.base_model.last_layer_name = 'classif'
            self.input_size = 299
            self.input_mean = [0.5]
            self.input_std = [0.5]
        else:
            raise ValueError('Unknown base model: {}'.format(base_model))


    # _prepare_tsn函数的功能在于对已知的base_model最后的全连接层进行修改,
    # 微调最后一层(全连接层)的结构,成为适合该数据集输出的形式。
    def _prepare_tsn(self, num_class):
        # 获取网络最后一层的输入feature_map的通道数,存储于feature_dim中
        # getattr(base_model,base_model.last_layer_name)
        # 得到Linear(in_features=2048, out_features=1000, bias=True)
        # getattr(base_model,base_model.last_layer_name).in_features 得到2048
        feature_dim = getattr(self.base_model, self.base_model.last_layer_name).in_features
        # 判断是否有dropout层,如果有,则添加一个dropout层后再添加一个全连接层,否则直接连接全连接层
        if self.dropout == 0:
            # 当这个setattr运行结束后,self.base_model.last_layer_name这一层就是nn.Linear(feature_dim, num_class)
            setattr(self.base_model, self.base_model.last_layer_name, nn.Linear(feature_dim, num_class))
            self.new_fc = None
        else:
            # 当这个setattr运行结束后,self.base_model.last_layer_name这一层就是nn.Dropout(p=self.dropout)
            setattr(self.base_model, self.base_model.last_layer_name, nn.Dropout(p=self.dropout))
            # 新建一个fc层供后续的forward函数调用,该层并未在base_model中
            self.new_fc = nn.Linear(feature_dim, num_class)

        # 对全连接层的参数weight做一个0均值且指定标准差(std=0.001)的初始化操作,bias初始化为0
        std = 0.001
        if self.new_fc is None:
            normal(getattr(self.base_model, self.base_model.last_layer_name).weight, 0, std)
            constant(getattr(self.base_model, self.base_model.last_layer_name).bias, 0)
        else:
            normal(self.new_fc.weight, 0, std)
            constant(self.new_fc.bias, 0)
        return feature_dim


    # 重写train函数,冻结除了第一层以外的其他BN层
    def train(self, mode=True):
        """
        Override the default train() to freeze the BN parameters
        :return:
        """
        super(TSN, self).train(mode)
        count = 0
        if self._enable_pbn:
            print("Freezing BatchNorm2D except the first one.")
            for m in self.base_model.modules():
                if isinstance(m, nn.BatchNorm2d):
                    count += 1
                    if count >= (2 if self._enable_pbn else 1):
                        m.eval()

                        # shutdown update in frozen mode
                        m.weight.requires_grad = False
                        m.bias.requires_grad = False

    def partialBN(self, enable):
        self._enable_pbn = enable

    # 获取模型的每一层并保存参数用于优化
    def get_optim_policies(self):
        first_conv_weight = []
        first_conv_bias = []
        normal_weight = []
        normal_bias = []
        bn = []

        conv_cnt = 0
        bn_cnt = 0
        for m in self.modules():
            if isinstance(m, torch.nn.Conv2d) or isinstance(m, torch.nn.Conv1d):
                ps = list(m.parameters())
                conv_cnt += 1
                if conv_cnt == 1:
                    first_conv_weight.append(ps[0])
                    if len(ps) == 2:
                        first_conv_bias.append(ps[1])
                else:
                    normal_weight.append(ps[0])
                    if len(ps) == 2:
                        normal_bias.append(ps[1])
            elif isinstance(m, torch.nn.Linear):
                ps = list(m.parameters())
                normal_weight.append(ps[0])
                if len(ps) == 2:
                    normal_bias.append(ps[1])
                  
            elif isinstance(m, torch.nn.BatchNorm1d):
                bn.extend(list(m.parameters()))
            elif isinstance(m, torch.nn.BatchNorm2d):
                bn_cnt += 1
                # later BN's are frozen
                if not self._enable_pbn or bn_cnt == 1:
                    bn.extend(list(m.parameters()))
            elif len(m._modules) == 0:
                if len(list(m.parameters())) > 0:
                    raise ValueError("New atomic module type: {}. Need to give it a learning policy".format(type(m)))

        return [
            {'params': first_conv_weight, 'lr_mult': 5 if self.modality == 'Flow' else 1, 'decay_mult': 1,
             'name': "first_conv_weight"},
            {'params': first_conv_bias, 'lr_mult': 10 if self.modality == 'Flow' else 2, 'decay_mult': 0,
             'name': "first_conv_bias"},
            {'params': normal_weight, 'lr_mult': 1, 'decay_mult': 1,
             'name': "normal_weight"},
            {'params': normal_bias, 'lr_mult': 2, 'decay_mult': 0,
             'name': "normal_bias"},
            {'params': bn, 'lr_mult': 1, 'decay_mult': 0,
             'name': "BN scale/shift"},
        ]

    # 向前传播
    def forward(self, input):
        sample_len = (3 if self.modality == "RGB" else 2) * self.new_length

        if self.modality == 'RGBDiff':
            sample_len = 3 * self.new_length
            input = self._get_diff(input)

        base_out = self.base_model(input.view((-1, sample_len) + input.size()[-2:]))

        if self.dropout > 0:
            base_out = self.new_fc(base_out)

        if not self.before_softmax:
            base_out = self.softmax(base_out)
        if self.reshape:
            base_out = base_out.view((-1, self.num_segments) + base_out.size()[1:])

        output = self.consensus(base_out)
        return output.squeeze(1)


    def _get_diff(self, input, keep_rgb=False):
        input_c = 3 if self.modality in ["RGB", "RGBDiff"] else 2
        input_view = input.view((-1, self.num_segments, self.new_length + 1, input_c,) + input.size()[2:])
        if keep_rgb:
            new_data = input_view.clone()
        else:
            new_data = input_view[:, :, 1:, :, :, :].clone()

        for x in reversed(list(range(1, self.new_length + 1))):
            if keep_rgb:
                new_data[:, :, x, :, :, :] = input_view[:, :, x, :, :, :] - input_view[:, :, x - 1, :, :, :]
            else:
                new_data[:, :, x - 1, :, :, :] = input_view[:, :, x, :, :, :] - input_view[:, :, x - 1, :, :, :]

        return new_data


    def _construct_flow_model(self, base_model):
        # modify the convolution layers
        # Torch models are usually defined in a hierarchical way.
        # nn.modules.children() return all sub modules in a DFS manner
        modules = list(self.base_model.modules())
        first_conv_idx = list(filter(lambda x: isinstance(modules[x], nn.Conv2d), list(range(len(modules)))))[0]
        conv_layer = modules[first_conv_idx]
        container = modules[first_conv_idx - 1]

        # modify parameters, assume the first blob contains the convolution kernels
        params = [x.clone() for x in conv_layer.parameters()]
        kernel_size = params[0].size()
        new_kernel_size = kernel_size[:1] + (2 * self.new_length, ) + kernel_size[2:]
        new_kernels = params[0].data.mean(dim=1, keepdim=True).expand(new_kernel_size).contiguous()

        new_conv = nn.Conv2d(2 * self.new_length, conv_layer.out_channels,
                             conv_layer.kernel_size, conv_layer.stride, conv_layer.padding,
                             bias=True if len(params) == 2 else False)
        new_conv.weight.data = new_kernels
        if len(params) == 2:
            new_conv.bias.data = params[1].data # add bias if neccessary
        layer_name = list(container.state_dict().keys())[0][:-7] # remove .weight suffix to get the layer name

        # replace the first convlution layer
        setattr(container, layer_name, new_conv)
        return base_model

    def _construct_diff_model(self, base_model, keep_rgb=False):
        # modify the convolution layers
        # Torch models are usually defined in a hierarchical way.
        # nn.modules.children() return all sub modules in a DFS manner
        modules = list(self.base_model.modules())
        first_conv_idx = list(filter(lambda x: isinstance(modules[x], nn.Conv2d), list(range(len(modules)))))[0]
        conv_layer = modules[first_conv_idx]
        container = modules[first_conv_idx - 1]

        # modify parameters, assume the first blob contains the convolution kernels
        params = [x.clone() for x in conv_layer.parameters()]
        kernel_size = params[0].size()
        if not keep_rgb:
            new_kernel_size = kernel_size[:1] + (3 * self.new_length,) + kernel_size[2:]
            new_kernels = params[0].data.mean(dim=1, keepdim=True).expand(new_kernel_size).contiguous()
        else:
            new_kernel_size = kernel_size[:1] + (3 * self.new_length,) + kernel_size[2:]
            new_kernels = torch.cat((params[0].data, params[0].data.mean(dim=1, keepdim=True).expand(new_kernel_size).contiguous()),
                                    1)
            new_kernel_size = kernel_size[:1] + (3 + 3 * self.new_length,) + kernel_size[2:]

        new_conv = nn.Conv2d(new_kernel_size[1], conv_layer.out_channels,
                             conv_layer.kernel_size, conv_layer.stride, conv_layer.padding,
                             bias=True if len(params) == 2 else False)
        new_conv.weight.data = new_kernels
        if len(params) == 2:
            new_conv.bias.data = params[1].data  # add bias if neccessary
        layer_name = list(container.state_dict().keys())[0][:-7]  # remove .weight suffix to get the layer name

        # replace the first convolution layer
        setattr(container, layer_name, new_conv)
        return base_model

    @property
    def crop_size(self):
        return self.input_size

    @property
    def scale_size(self):
        return self.input_size * 256 // 224

    # 根据输入不同获取不同的数据预处理操作
    def get_augmentation(self):
        if self.modality == 'RGB':
            return torchvision.transforms.Compose([GroupMultiScaleCrop(self.input_size, [1, .875, .75, .66]),
                                                   GroupRandomHorizontalFlip(is_flow=False)])
        elif self.modality == 'Flow':
            return torchvision.transforms.Compose([GroupMultiScaleCrop(self.input_size, [1, .875, .75]),
                                                   GroupRandomHorizontalFlip(is_flow=True)])
        elif self.modality == 'RGBDiff':
            return torchvision.transforms.Compose([GroupMultiScaleCrop(self.input_size, [1, .875, .75]),
                                                   GroupRandomHorizontalFlip(is_flow=False)])

五、dataset.py解读

1.整体架构

1.class VideoRecord:
封装视频信息,返回数据的信息(帧路径、视频包含多少帧、帧标签)
2.class TSNDataSet:
(1)init:初始化,设置参数
(2)_load_image::根据路径加载图片
(3)_parse_list:读取list_file,将video的名字、帧数、标签封装成VideoRecord类,存储在video_list中
(4)_sample_indices:TSN的稀疏采样,返回的是稀疏采样的帧数列表
(5)_get_val_indices:获取验证采样针列表
(6)_get_test_indices:获取测试采样针列表
(7)getitem:调用稀疏采样_sample_indices,并调用get方法得到TSNDataSet的返回
(8)get:获取提帧后的图片并做变化(角裁剪、中心提取等)
(9)len:返回数据集长度

2.具体代码解读

import torch.utils.data as data
from PIL import Image
import os
import os.path
import numpy as np
from numpy.random import randint

# dataset.py的主要功能就是对数据集进行读取,并且对其稀疏采样,返回稀疏采样后得到的数据集

# 封装视频信息,返回数据的信息(帧路径、视频包含多少帧、帧标签)
class VideoRecord(object):
    def __init__(self, row):
        self._data = row

    @property
    def path(self):
        return self._data[0]

    @property
    def num_frames(self):
        return int(self._data[1])

    @property
    def label(self):
        return int(self._data[2])

# 首先定义了一个类TSNDataSet,用来处理最原始的数据。
# TSNDataSet继承了pytorch中原生的Dataset类,该类返回的是torch.utils.data.Dataset类型,
# 注:一般而言在pytorch中自定义的数据读取类都要继承torch.utils.DataSet这个基类,
# 然后通过重写_init_和_getitem_方法来读取函数。

class TSNDataSet(data.Dataset):
    # 初始化,设置参数
    def __init__(self, root_path, list_file,
                 num_segments=3, new_length=1, modality='RGB',
                 image_tmpl='img_{:05d}.jpg', transform=None,
                 force_grayscale=False, random_shift=True, test_mode=False):

        # root_path:项目根目录
        # list_file:训练/测试的列表文件(.txt)地址
        # num_segments:视频分割的段数
        # new_length:视频取帧的起点
        # modality:输入的模态(RGB、光流、RGB diff)
        # image_tmpl:图片名
        # transform:数据变换操作
        # random_shift:洗漱采样时是否增加一个随机数
        # test_mode:是否为测试模式

        self.root_path = root_path
        self.list_file = list_file
        self.num_segments = num_segments
        self.new_length = new_length
        self.modality = modality
        self.image_tmpl = image_tmpl
        self.transform = transform
        self.random_shift = random_shift
        self.test_mode = test_mode

        if self.modality == 'RGBDiff':
            self.new_length += 1 # Diff needs one more image to calculate diff

        self._parse_list()

    # 根据路径加载图片
    def _load_image(self, directory, idx):
        if self.modality == 'RGB' or self.modality == 'RGBDiff':
            return [Image.open(os.path.join(directory, self.image_tmpl.format(idx))).convert('RGB')]
        elif self.modality == 'Flow':
            x_img = Image.open(os.path.join(directory, self.image_tmpl.format('x', idx))).convert('L')
            y_img = Image.open(os.path.join(directory, self.image_tmpl.format('y', idx))).convert('L')

            return [x_img, y_img]

    # 读取list_file,将video的名字、帧数、标签封装成VideoRecord类,存储在video_list内
    # self.list_file是训练或测试的列表文件(.txt文件),里面包含三列内容,用空格键分隔,
    # 第一列是video名,第二列是video的帧数,第三列是video的标签,
    # 分别将这三个信息提取出来封装为VideoRecord对象存储在video_list中。
    def _parse_list(self):
        self.video_list = [VideoRecord(x.strip().split(' ')) for x in open(self.list_file)]

    # TSN的稀疏采样,返回的是稀疏采样的帧数列表
    def _sample_indices(self, record):
        """
        :param record: VideoRecord
        :return: list
        """
        # 假设一个视频共有125帧,num_segments=3,输入模态为RGB,稀疏采样的步骤如下:
        # 将视频分成num_segments=3段。根据代码,record.num_frames=150,self.new_length=1,
        # 求出平均每段的帧数为50,即average_duration = 50
        # 定义一个list类型的变量offset
        # 首先取第一个片段里的帧,假设随机数randint(average_duration,size=self.num_segments)=10,
        # 第一个片段时range(self.num_segments)=0,计算可得第一个片段中取到的帧编号为10
        # 同理可获得其他片段中取到帧的编号,假设第二帧时,随机数取12,第三帧时,随机数取15,计算可得第二个、第三个片段中取到的帧编号,分别为62,115
        # 经过上述计算,列表offset = [10, 62, 115],当返回时,返回的为offset + 1,即真正取到的帧数为[11, 63, 116]
        average_duration = (record.num_frames - self.new_length + 1) // self.num_segments
        if average_duration > 0:
            offsets = np.multiply(list(range(self.num_segments)), average_duration) + randint(average_duration, size=self.num_segments)
        elif record.num_frames > self.num_segments:
            offsets = np.sort(randint(record.num_frames - self.new_length + 1, size=self.num_segments))
        else:
            offsets = np.zeros((self.num_segments,))
        return offsets + 1

    # 获取验证采样帧列表,本方法在模型内部val的时候调用
    def _get_val_indices(self, record):
        if record.num_frames > self.num_segments + self.new_length - 1:
            tick = (record.num_frames - self.new_length + 1) / float(self.num_segments)
            offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)])
        else:
            offsets = np.zeros((self.num_segments,))
        return offsets + 1

    # 获取测试采样帧列表,本方法在模型外部test的时候调用
    # 将输入video按照相等帧数距离分成self.num_segments份,
    # 最终返回的offsets就是长度为self.num_segments的numpy array,
    # 表示从输入video中取哪些帧作为模型的输入。
    def _get_test_indices(self, record):

        tick = (record.num_frames - self.new_length + 1) / float(self.num_segments)

        offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)])

        return offsets + 1

    # 调用稀疏采样_sample_indices,并调用get方法得到TSNDataSet的返回
    # 该函数会在TSNDataSet初始化之后执行,功能在于调用执行稀疏采样的函数_sample_indices,并且调用get方法,得到TSNDataSet的返回
    # record变量读取的是video_list的第index个数据,包含该视频所在的文件地址、视频包含的帧数和视频所属的分类
    # 训练时self.test_mode是False,故执行if语句,而self.random_shift默认是True,所以最终执行的是_sample_indices(record)采样函数。
    # 测试时self.test_mode为True,实际执行的是_get_test_indices函数
    # 将稀疏采样获得的帧列表保存于segment_indices中,之后调用get()方法,作为其中的参数
    def __getitem__(self, index):
        record = self.video_list[index]

        if not self.test_mode:
            segment_indices = self._sample_indices(record) if self.random_shift else self._get_val_indices(record)
        else:
            segment_indices = self._get_test_indices(record)

        return self.get(record, segment_indices)

    # 获取提帧后的图片并做变化(角裁剪、中心提取等)
    # 对提取到的帧序号列表进行遍历,找到每一个帧对应的图片,添加到Images列表中
    # 之后对提到的images进行数据集变形,返回变形后的数据集和对应的类型标签
    def get(self, record, indices):

        images = list()
        for seg_ind in indices:
            p = int(seg_ind)
            for i in range(self.new_length):
                seg_imgs = self._load_image(record.path, p)
                images.extend(seg_imgs)
                if p < record.num_frames:
                    p += 1

        process_data = self.transform(images)
        return process_data, record.label

    # 返回数据集长度
    def __len__(self):
        return len(self.video_list)

六、test_models.py解读

1.整体框架

先导入参数,然后通过models.py脚本中的TSN类来导入网络结构,通过torch.load()来导入预训练的模型,然后初始化测试时候的参数,再设置好GPU的模式,调用eval_video()进行预测,最后再统计一下就OK了

2.具体代码解读

# 测试模型的入口

import argparse
import time

import numpy as np
import torch.nn.parallel
import torch.optim
from sklearn.metrics import confusion_matrix

from dataset import TSNDataSet
from models import TSN
from transforms import *
from ops import ConsensusModule

# options,导入参数
parser = argparse.ArgumentParser(
    description="Standard video-level testing")
parser.add_argument('dataset', type=str, choices=['ucf101', 'hmdb51', 'kinetics'])
parser.add_argument('modality', type=str, choices=['RGB', 'Flow', 'RGBDiff'])
parser.add_argument('test_list', type=str)
parser.add_argument('weights', type=str)
parser.add_argument('--arch', type=str, default="resnet101")
parser.add_argument('--save_scores', type=str, default=None)
parser.add_argument('--test_segments', type=int, default=25)
parser.add_argument('--max_num', type=int, default=-1)
parser.add_argument('--test_crops', type=int, default=10)
parser.add_argument('--input_size', type=int, default=224)
parser.add_argument('--crop_fusion_type', type=str, default='avg',
                    choices=['avg', 'max', 'topk'])
parser.add_argument('--k', type=int, default=3)
parser.add_argument('--dropout', type=float, default=0.7)
parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
                    help='number of data loading workers (default: 4)')
parser.add_argument('--gpus', nargs='+', type=int, default=None)
parser.add_argument('--flow_prefix', type=str, default='')

args = parser.parse_args()

# 根据数据集来确定类别数
if args.dataset == 'ucf101':
    num_class = 101
elif args.dataset == 'hmdb51':
    num_class = 51
elif args.dataset == 'kinetics':
    num_class = 400
else:
    raise ValueError('Unknown dataset '+args.dataset)

# 通过models.py脚本中的TSN类来导入网络结构
# 另外如果想查看得到的网络net的各层信息,可以通过net.state_dict()来查看
net = TSN(num_class, 1, args.modality,
          base_model=args.arch,
          consensus_type=args.crop_fusion_type,
          dropout=args.dropout)

# checkpoint = torch.load(args.weights)是导入预训练的模型
# 在PyTorch中,导入模型都是采用torch.load()接口实现,输入args.weights就是.pth文件,也就是预训练模型。
checkpoint = torch.load(args.weights)
print("model epoch {} best prec@1: {}".format(checkpoint['epoch'], checkpoint['best_prec1']))

# 读取预训练模型的层和具体参数并存到base_dict这个字典中
base_dict = {'.'.join(k.split('.')[1:]): v for k,v in list(checkpoint['state_dict'].items())}

# net.load_state_dict(base_dict)就是通过调用torch.nn.Module类的load_state_dict方法,达到用预训练模型初始化net网络的过程
net.load_state_dict(base_dict)


# 接下来关于args.test_crops的条件语句是用来对数据做不同的crop操作:简单crop操作和重复采样的crop操作
# crop==1,就先resize到指定尺寸(比如从400resize到256),
# 然后再做center crop操作,最后得到的是net.input_size的尺寸(比如224)
# 注意这里一张图片做完这些crop操作后输出还是一张图片
if args.test_crops == 1:
    cropping = torchvision.transforms.Compose([
        GroupScale(net.scale_size),
        GroupCenterCrop(net.input_size),
    ])
# 如果args.test_crops等于10,那么就调用该项目下的transforms.py脚本中的GroupOverSample类进行重复采样的crop操作,最终一张图像得到10张crop的结果
elif args.test_crops == 10:
    cropping = torchvision.transforms.Compose([
        GroupOverSample(net.input_size, net.scale_size)
    ])
else:
    raise ValueError("Only 1 and 10 crops are supported while we got {}".format(args.test_crops))

# 读取数据,num_segments的默认参数是25,比训练的时候多的多,
# test_mode=True,所以在调用TSNDataSet类的__getitme__方法时和训练时候有些差别
data_loader = torch.utils.data.DataLoader(
        TSNDataSet("", args.test_list, num_segments=args.test_segments,
                   new_length=1 if args.modality == "RGB" else 5,
                   modality=args.modality,
                   image_tmpl="img_{:05d}.jpg" if args.modality in ['RGB', 'RGBDiff'] else args.flow_prefix+"{}_{:05d}.jpg",
                   test_mode=True,
                   transform=torchvision.transforms.Compose([
                       cropping,
                       Stack(roll=args.arch == 'BNInception'),
                       ToTorchFormatTensor(div=args.arch != 'BNInception'),
                       GroupNormalize(net.input_mean, net.input_std),
                   ])),
        batch_size=1, shuffle=False,
        num_workers=args.workers * 2, pin_memory=True)

# 设置GPU模式、初始化数据
if args.gpus is not None:
    devices = [args.gpus[i] for i in range(args.workers)]
else:
    devices = list(range(args.workers))


net = torch.nn.DataParallel(net.cuda(devices[0]), device_ids=devices)
net.eval()

data_gen = enumerate(data_loader)

total_num = len(data_loader.dataset)
output = []


# 测试的主题,当准备好了测试数据和模型后,就通过这个函数进行预
def eval_video(video_data):
    # 输入video_data是一个元组:(i, data, label)
    i, data, label = video_data
    num_crop = args.test_crops

    if args.modality == 'RGB':
        length = 3
    elif args.modality == 'Flow':
        length = 10
    elif args.modality == 'RGBDiff':
        length = 18
    else:
        raise ValueError("Unknown modality "+args.modality)

    # data.view(-1, length, data.size(2), data.size(3))是将原本输入为(1,3*args.test_crops*args.test_segments,224,224)
    # 变换到(args.test_crops*args.test_segments,3,224,224),当于batch size为args.test_crops*args.test_segments
    # 然后用torch.autograd.Variable接口封装成Variable类型数据并作为模型的输入
    input_var = torch.autograd.Variable(data.view(-1, length, data.size(2), data.size(3)),
                                        volatile=True)
    # net(input_var)得到的结果是Variable
    # 如果要读取Tensor内容,需读取data变量,cpu()表示存储到cpu,numpy()表示Tensor转为numpy array,copy()表示拷贝
    rst = net(input_var).data.cpu().numpy().copy()
    # rst.reshape((num_crop, args.test_segments, num_class))表示将输入维数(二维)变化到指定维数(三维)
    # mean(axis=0)表示对num_crop维度取均值,也就是原来对某帧图像的10张crop或clip图像做预测,最后是取这10张预测结果的均值作为该帧图像的结果
    # 最后再执行一个reshape操作。最后返回的是3个值,分别表示video的index,预测结果和video的真实标签
    return i, rst.reshape((num_crop, args.test_segments, num_class)).mean(axis=0).reshape(
        (args.test_segments, 1, num_class)
    ), label[0]


proc_start_time = time.time()
max_num = args.max_num if args.max_num > 0 else len(data_loader.dataset)

# 开始循环读取数据,每执行一次循环表示读取一个video数据
# 在循环中主要是调用eval_video来测试,预测结果和真实标签的结果都保存在output列表中
for i, (data, label) in data_gen:
    if i >= max_num:
        break
    rst = eval_video((i, data, label))
    output.append(rst[1:])
    cnt_time = time.time() - proc_start_time
    print('video {} done, total {}/{}, average {} sec/video'.format(i, i+1,
                                                                    total_num,
                                                                    float(cnt_time) / (i+1)))

# 接下来要计算video-level的预测结果
# 这里从np.mean(x[0], axis=0)可以看出对args.test_segments帧图像的结果采取的也是均值方法来计算video-level的预测结果
# 然后通过np.argmax将概率最大的那个类别作为该video的预测类别
video_pred = [np.argmax(np.mean(x[0], axis=0)) for x in output]

# video_labels则是真实类别
video_labels = [x[1] for x in output]

# 调用了混淆矩阵生成结果(numpy array)
# 举个例子,y_true=[2,0,2,2,0,1],y_pred=[0,0,2,2,0,2],那么confusion_matrix(y_true, y_pred)的结果就是
# array([[2,0,0],[0,0,1],[1,0,2]]),每行表示真实类别,每列表示预测类别
cf = confusion_matrix(video_labels, video_pred).astype(float)

# 因此cls_cnt = cf.sum(axis=1)表示每个真实类别有多少个video
cls_cnt = cf.sum(axis=1)
# cls_hit = np.diag(cf)就是将cf的对角线数据取出,表示每个类别的video中各预测对了多少个
cls_hit = np.diag(cf)

# 因此cls_acc = cls_hit / cls_cnt就是每个类别的video预测准确率
cls_acc = cls_hit / cls_cnt

print(cls_acc)

# np.mean(cls_acc)就是各类别的平均准确率
print('Accuracy {:.02f}%'.format(np.mean(cls_acc) * 100))


# 将预测结果保存成文件
if args.save_scores is not None:

    # reorder before saving
    name_list = [x.strip().split()[0] for x in open(args.test_list)]

    order_dict = {e:i for i, e in enumerate(sorted(name_list))}

    reorder_output = [None] * len(output)
    reorder_label = [None] * len(output)

    for i in range(len(output)):
        idx = order_dict[name_list[i]]
        reorder_output[idx] = output[i]
        reorder_label[idx] = video_labels[i]

    np.savez(args.save_scores, scores=reorder_output, labels=reorder_label)

七、transforms.py解读

不用读太懂,知道每个函数是干什么的就行了。

import torchvision
import random
from PIL import Image, ImageOps
import numpy as np
import numbers
import math
import torch

# 均为Group操作,整组操作减少时间

# 随机挖取固定大小图片,(x1,y1,th,tw)
class GroupRandomCrop(object):
    def __init__(self, size):
        if isinstance(size, numbers.Number):
            self.size = (int(size), int(size))
        else:
            self.size = size

    def __call__(self, img_group):

        w, h = img_group[0].size
        th, tw = self.size

        out_images = list()

        # 左上(x1,y1)
        x1 = random.randint(0, w - tw)
        y1 = random.randint(0, h - th)

        for img in img_group:
            assert(img.size[0] == w and img.size[1] == h)
            if w == tw and h == th:
                out_images.append(img)
            else:
                out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))

        return out_images


# 从中间区域挖取固定大小图片
class GroupCenterCrop(object):
    def __init__(self, size):
        self.worker = torchvision.transforms.CenterCrop(size)

    def __call__(self, img_group):
        return [self.worker(img) for img in img_group]


# 以50%的概率决定是否进行随机水平翻转
class GroupRandomHorizontalFlip(object):
    """Randomly horizontally flips the given PIL.Image with a probability of 0.5
    """
    def __init__(self, is_flow=False):
        self.is_flow = is_flow

    def __call__(self, img_group, is_flow=False):
        v = random.random()
        if v < 0.5:
            ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
            if self.is_flow:
                for i in range(0, len(ret), 2):
                    ret[i] = ImageOps.invert(ret[i])  # invert flow pixel values when flipping
            return ret
        else:
            return img_group


# # GroupNormalize位于ToTorchFormatTensor()之后,其输入tensor大小为N*CxHxW
class GroupNormalize(object):
    def __init__(self, mean, std):
        self.mean = mean
        self.std = std

    def __call__(self, tensor):
        rep_mean = self.mean * (tensor.size()[0]//len(self.mean))
        rep_std = self.std * (tensor.size()[0]//len(self.std))

        # TODO: make efficient
        for t, m, s in zip(tensor, rep_mean, rep_std):
            t.sub_(m).div_(s)

        return tensor


# 对输入的N张图像都做torchvision.transform.Scale操作,也就是resize到指定尺寸
class GroupScale(object):
    """ Rescales the input PIL.Image to the given 'size'.
    'size' will be the size of the smaller edge.
    For example, if height > width, then image will be
    rescaled to (size * height / width, size)
    size: size of the smaller edge
    interpolation: Default: PIL.Image.BILINEAR
    """

    def __init__(self, size, interpolation=Image.BILINEAR):
        self.worker = torchvision.transforms.Scale(size, interpolation)

    def __call__(self, img_group):
        return [self.worker(img) for img in img_group]


#获得10张crop图像,左上、右上、左下、右下、正中5;加上镜像5class GroupOverSample(object):
    def __init__(self, crop_size, scale_size=None):
        self.crop_size = crop_size if not isinstance(crop_size, int) else (crop_size, crop_size)

        if scale_size is not None:
            self.scale_worker = GroupScale(scale_size)
        else:
            self.scale_worker = None

    def __call__(self, img_group):

        # 将img_group中图像resize至固定大小(scale_size)
        if self.scale_worker is not None:
            img_group = self.scale_worker(img_group)

        image_w, image_h = img_group[0].size
        crop_w, crop_h = self.crop_size

        # 切取左上、右上、左下、右下、正中5张。offset_w,offset_h
        offsets = GroupMultiScaleCrop.fill_fix_offset(False, image_w, image_h, crop_w, crop_h)
        oversample_group = list()
        for o_w, o_h in offsets:
            normal_group = list()
            flip_group = list()
            for i, img in enumerate(img_group):
                crop = img.crop((o_w, o_h, o_w + crop_w, o_h + crop_h))
                normal_group.append(crop)
                # 镜像左上、右上、左下、右下、正中
                flip_crop = crop.copy().transpose(Image.FLIP_LEFT_RIGHT)
                # 处理Flow情况
                if img.mode == 'L' and i % 2 == 0:
                    flip_group.append(ImageOps.invert(flip_crop))
                else:
                    flip_group.append(flip_crop)

            oversample_group.extend(normal_group)
            oversample_group.extend(flip_group)
        # 获得10,左上、右上、左下、右下、正中5;加上镜像5return oversample_group


# 类初始化->_sample_crop_size()->img.crop()->img.resize->return ret_img_group
class GroupMultiScaleCrop(object):

    def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True):
        self.scales = scales if scales is not None else [1, .875, .75, .66]
        self.max_distort = max_distort
        self.fix_crop = fix_crop
        self.more_fix_crop = more_fix_crop
        self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size]
        self.interpolation = Image.BILINEAR

    def __call__(self, img_group):

        im_size = img_group[0].size
        # 核心,仅获得一个crop_w,crop_h,offset_w,offset_h
        crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size)
        crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group]
        ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation)
                         for img in crop_img_group]
        return ret_img_group

    def _sample_crop_size(self, im_size):
        image_w, image_h = im_size[0], im_size[1]

        # find a crop size
        base_size = min(image_w, image_h)
        crop_sizes = [int(base_size * x) for x in self.scales]
        crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes]
        crop_w = [self.input_size[0] if abs(x - self.input_size[0]) < 3 else x for x in crop_sizes]

        pairs = []
        for i, h in enumerate(crop_h):
            for j, w in enumerate(crop_w):
                if abs(i - j) <= self.max_distort:
                    pairs.append((w, h))

        crop_pair = random.choice(pairs)
        if not self.fix_crop:
            w_offset = random.randint(0, image_w - crop_pair[0])
            h_offset = random.randint(0, image_h - crop_pair[1])
        else:
            w_offset, h_offset = self._sample_fix_offset(image_w, image_h, crop_pair[0], crop_pair[1])

        return crop_pair[0], crop_pair[1], w_offset, h_offset

    def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h):
        offsets = self.fill_fix_offset(self.more_fix_crop, image_w, image_h, crop_w, crop_h)
        return random.choice(offsets)

    @staticmethod
    def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h):
        w_step = (image_w - crop_w) // 4
        h_step = (image_h - crop_h) // 4

        ret = list()
        ret.append((0, 0))  # upper left
        ret.append((4 * w_step, 0))  # upper right
        ret.append((0, 4 * h_step))  # lower left
        ret.append((4 * w_step, 4 * h_step))  # lower right
        ret.append((2 * w_step, 2 * h_step))  # center

        if more_fix_crop:
            ret.append((0, 2 * h_step))  # center left
            ret.append((4 * w_step, 2 * h_step))  # center right
            ret.append((2 * w_step, 4 * h_step))  # lower center
            ret.append((2 * w_step, 0 * h_step))  # upper center

            ret.append((1 * w_step, 1 * h_step))  # upper left quarter
            ret.append((3 * w_step, 1 * h_step))  # upper right quarter
            ret.append((1 * w_step, 3 * h_step))  # lower left quarter
            ret.append((3 * w_step, 3 * h_step))  # lower righ quarter

        return ret


class GroupRandomSizedCrop(object):
    """Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size
    and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio
    This is popularly used to train the Inception networks
    size: size of the smaller edge
    interpolation: Default: PIL.Image.BILINEAR
    """
    def __init__(self, size, interpolation=Image.BILINEAR):
        self.size = size
        self.interpolation = interpolation

    def __call__(self, img_group):
        for attempt in range(10):
            area = img_group[0].size[0] * img_group[0].size[1]
            target_area = random.uniform(0.08, 1.0) * area
            aspect_ratio = random.uniform(3. / 4, 4. / 3)

            w = int(round(math.sqrt(target_area * aspect_ratio)))
            h = int(round(math.sqrt(target_area / aspect_ratio)))

            if random.random() < 0.5:
                w, h = h, w

            if w <= img_group[0].size[0] and h <= img_group[0].size[1]:
                x1 = random.randint(0, img_group[0].size[0] - w)
                y1 = random.randint(0, img_group[0].size[1] - h)
                found = True
                break
        else:
            found = False
            x1 = 0
            y1 = 0

        if found:
            out_group = list()
            for img in img_group:
                img = img.crop((x1, y1, x1 + w, y1 + h))
                assert(img.size == (w, h))
                out_group.append(img.resize((self.size, self.size), self.interpolation))
            return out_group
        else:
            # Fallback
            scale = GroupScale(self.size, interpolation=self.interpolation)
            crop = GroupRandomCrop(self.size)
            return crop(scale(img_group))


# 在通道维度上进行concatenate
class Stack(object):

    def __init__(self, roll=False):
        self.roll = roll

    def __call__(self, img_group):
        if img_group[0].mode == 'L':
            return np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2)
        elif img_group[0].mode == 'RGB':
            # 对通道维度进行翻转
            if self.roll:
                return np.concatenate([np.array(x)[:, :, ::-1] for x in img_group], axis=2)
            else:
                return np.concatenate(img_group, axis=2)


# 函数接受PIL Image或numpy.ndarray(opencv)
# 将其先由HWC转置为CHW格式,再转为float后每个像素除以255.
class ToTorchFormatTensor(object):
    """ Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255]
    to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] """
    def __init__(self, div=True):
        self.div = div

    def __call__(self, pic):
        if isinstance(pic, np.ndarray):
            # handle numpy array
            img = torch.from_numpy(pic).permute(2, 0, 1).contiguous()
        else:
            # handle PIL Image
            img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
            img = img.view(pic.size[1], pic.size[0], len(pic.mode))
            # put it from HWC to CHW format
            # yikes, this transpose takes 80% of the loading time/CPU
            img = img.transpose(0, 1).transpose(0, 2).contiguous()
        return img.float().div(255) if self.div else img.float()


# 加上call方法,成为可调用对象
class IdentityTransform(object):

    def __call__(self, data):
        return data


if __name__ == "__main__":
    trans = torchvision.transforms.Compose([
        GroupScale(256),
        GroupRandomCrop(224),
        Stack(),
        ToTorchFormatTensor(),
        GroupNormalize(
            mean=[.485, .456, .406],
            std=[.229, .224, .225]
        )]
    )

    im = Image.open('../tensorflow-model-zoo.torch/lena_299.png')

    color_group = [im] * 3
    rst = trans(color_group)

    gray_group = [im.convert('L')] * 9
    gray_rst = trans(gray_group)

    trans2 = torchvision.transforms.Compose([
        GroupRandomSizedCrop(256),
        Stack(),
        ToTorchFormatTensor(),
        GroupNormalize(
            mean=[.485, .456, .406],
            std=[.229, .224, .225])
    ])
    print(trans2(color_group))

八、basic_ops.py解读

不用读太懂,知道每个函数是干什么的就行了。

import torch
import math


class Identity(torch.nn.Module):
    def forward(self, input):
        return input


class SegmentConsensus(torch.autograd.Function):

    def __init__(self, consensus_type, dim=1):
        self.consensus_type = consensus_type
        self.dim = dim
        self.shape = None

    def forward(self, input_tensor):
        self.shape = input_tensor.size()
        if self.consensus_type == 'avg':
            output = input_tensor.mean(dim=self.dim, keepdim=True)
        elif self.consensus_type == 'identity':
            output = input_tensor
        else:
            output = None

        return output

    def backward(self, grad_output):
        if self.consensus_type == 'avg':
            grad_in = grad_output.expand(self.shape) / float(self.shape[self.dim])
        elif self.consensus_type == 'identity':
            grad_in = grad_output
        else:
            grad_in = None

        return grad_in


class ConsensusModule(torch.nn.Module):

    def __init__(self, consensus_type, dim=1):
        super(ConsensusModule, self).__init__()
        self.consensus_type = consensus_type if consensus_type != 'rnn' else 'identity'
        self.dim = dim

    def forward(self, input):
        return SegmentConsensus(self.consensus_type, self.dim)(input)

九、utils.py解读

不用读太懂,知道每个函数是干什么的就行了。

import torch
import numpy as np
from sklearn.metrics import confusion_matrix

# hook函数
def get_grad_hook(name):
    def hook(m, grad_in, grad_out):
        print((name, grad_out[0].data.abs().mean(), grad_in[0].data.abs().mean()))
        print((grad_out[0].size()))
        print((grad_in[0].size()))

        print((grad_out[0]))
        print((grad_in[0]))

    return hook


# softmax函数定义
def softmax(scores):
    es = np.exp(scores - scores.max(axis=-1)[..., None])
    return es / es.sum(axis=-1)[..., None]


# log_add函数定义
def log_add(log_a, log_b):
    return log_a + np.log(1 + np.exp(log_b - log_a))


# 准确率计算
def class_accuracy(prediction, label):
    cf = confusion_matrix(prediction, label)
    cls_cnt = cf.sum(axis=1)
    cls_hit = np.diag(cf)

    cls_acc = cls_hit / cls_cnt.astype(float)

    mean_cls_acc = cls_acc.mean()

    return cls_acc, mean_cls_acc

瞎写八写在最后

第一次读代码,读了快一周了还是迷离迷糊的,不知道从哪读起,也不知道怎么debug,python学的感觉也用不到,用到的都是没学的。慢慢来把,争取下一周读完+能在工作站上跑一下UCF-101的数据集。
上周日看了Randy教授的最后一课,感触最深的一段话是他说的遇到困难时候的态度,以前读到这种感觉就是鸡汤一看而过,当真正开始遇到这种有劲但不知道怎么使的困难的时候,觉得说的确实挺好的,可以振奋人心。送给大家共勉。
That was a bit of a setback.
But remember, the brick walls are there for a reason.
The brick walls are not there to keep us out.
The brick walls are there to give us a chance to show how badly we want something.
Because the brick walls are there to stop the people who don’t want it badly enough.
They’re there to stop the other people.
Remember brick walls let us show our dedication.
They are there to separate us from the people who don’t really want to achieve their childhood dreams.
还有个感触就是,很多事虽然不知道该怎么下手、该怎么做,但是别光犹豫也别光徘徊,开始动手做,哪怕做的很蠢,但是慢慢的就会好起来的(大概)(拆分成一个一个小任务在解决害怕困难这个问题上确实挺管用的)。

你可能感兴趣的:(CV动作识别方向论文研读笔记,python,深度学习,开发语言)