神经网络load_state_dict()进阶使用

 很多时候我们需要提前加载预训练的模型,一般情况下直接使用model.load_state_dict(torch.load(state_path) )就行了,但是有些时候预训练的模型可能和要训练的模型之间不是所有参数都能对应上的。

分为三种情况,1.某个参数预训练模型有但是目标模型没有。2某个参数预训练模型没有但是目标模型有。3参数预训练模型和目标模型都有但是参数对不上(例如形状不同)。前两者不会影响加载,将strice设置为False就可以了,同时会输出missing_keys和unexpected_keys,说明哪些参数缺失。

但是如果参数的名称对的上但是值的形状对不上就会有问题,无法加载,这个时候我们可以通过将del state_dict(wrong_key)的方法来消除问题


missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=strict)

 完整代码如下,map_location是用来将数据所在的设备进行重定向的

    device = torch.cuda.current_device()
    state_dict = torch.load(checkpoint, map_location=lambda storage, loc: storage.cuda(device))
    src_state_dict = state_dict['net']
    target_state_dict = model.state_dict()
    skip_keys = []
    # skip mismatch size tensors in case of pretraining
    for k in src_state_dict.keys():
        if k not in target_state_dict:
            continue
        if src_state_dict[k].size() != target_state_dict[k].size():
            skip_keys.append(k)
    for k in skip_keys:
        del src_state_dict[k]
    missing_keys, unexpected_keys = model.load_state_dict(src_state_dict, strict=strict)
    if skip_keys:
        logger.info(
            f'removed keys in source state_dict due to size mismatch: {", ".join(skip_keys)}')
    if missing_keys:
        logger.info(f'missing keys in source state_dict: {", ".join(missing_keys)}')
    if unexpected_keys:
        logger.info(f'unexpected key in source state_dict: {", ".join(unexpected_keys)}')

实际上在训练过程中我们最好还保存模型的优化器的参数,这样使得我们的训练即使被中断也可以继续训练,因为一般优化器的参数会随着训练的进行自动进行调节,所以保存优化器的参数也是很重要的。此外就是训练的轮数,方便我们知道训练了多少轮。

# load optimizer
    if optimizer is not None:
        assert 'optimizer' in state_dict
        optimizer.load_state_dict(state_dict['optimizer'])

    if 'epoch' in state_dict:
        epoch = state_dict['epoch']
    else:
        epoch = 0

加载的完整代码

def load_checkpoint(checkpoint, logger, model, optimizer=None, strict=False):
    device = torch.cuda.current_device()
    state_dict = torch.load(checkpoint, map_location=lambda storage, loc: storage.cuda(device))
    src_state_dict = state_dict['net']
    target_state_dict = model.state_dict()
    skip_keys = []
    # skip mismatch size tensors in case of pretraining
    for k in src_state_dict.keys():
        if k not in target_state_dict:
            continue
        if src_state_dict[k].size() != target_state_dict[k].size():
            skip_keys.append(k)
    for k in skip_keys:
        del src_state_dict[k]
    missing_keys, unexpected_keys = model.load_state_dict(src_state_dict, strict=strict)
    if skip_keys:
        logger.info(
            f'removed keys in source state_dict due to size mismatch: {", ".join(skip_keys)}')
    if missing_keys:
        logger.info(f'missing keys in source state_dict: {", ".join(missing_keys)}')
    if unexpected_keys:
        logger.info(f'unexpected key in source state_dict: {", ".join(unexpected_keys)}')

    # load optimizer
    if optimizer is not None:
        assert 'optimizer' in state_dict
        optimizer.load_state_dict(state_dict['optimizer'])

    if 'epoch' in state_dict:
        epoch = state_dict['epoch']
    else:
        epoch = 0
    return epoch + 1

顺便把配套的保存训练结果的代码也方进来

其中的

def checkpoint_save(epoch, model, optimizer, work_dir, save_freq=16):
    f = os.path.join(work_dir, f'epoch_{epoch}.pth')
    checkpoint = {
        'net': model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'epoch': epoch
    }
    torch.save(checkpoint, f)
#如果已经存在一个latest最新的pth文件,将其移除,然后利用ln -s命令将epoch_xxx.pth和 latest.pth链接起来
    if os.path.exists(f'{work_dir}/latest.pth'):
        os.remove(f'{work_dir}/latest.pth')
    os.system(f'cd {work_dir}; ln -s {osp.basename(f)} latest.pth')

#除非epoch为2的某个指数的值或者是save_freq的某个倍数,否则移除该pth文件,避免保存过多的pth文件
    epoch = epoch - 1
    f = os.path.join(work_dir, f'epoch_{epoch}.pth')
    if os.path.isfile(f):
        if not is_multiple(epoch, save_freq) and not is_power2(epoch):
            os.remove(f)

def is_power2(num):
    return num != 0 and ((num & (num - 1)) == 0)


def is_multiple(num, multiple):
    return num != 0 and num % multiple == 0

你可能感兴趣的:(神经网络,深度学习,机器学习)