pytorch GradScaler() 出现 UserWarning: Detected call of lr_scheduler.step() before optimizer.step().

问题:

在pytorch中使用GradScaler + lr_scheduler后出现UserWarning:

Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.

检测到lr_scheduler.step()的调用在optimizer.step()之前。在PyTorch 1.1.0及更高版本中,应按相反顺序调用它们:optimizer.step()在lr_scheduler.step()之前。否则将导致PyTorch跳过学习率计划的第一个值。

在新版的pytorch中,参数更新optimizer.step()应该放在学习率调整 lr_scheduler.step()之前

然而如下示例代码所示,在使用了GradScaler之后,即便scaler.step(optimizer)放在scheduler.step()之前,仍然会收到此警告。

steps = len(train_dl) * epochs
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, steps_per_epoch=len(train_dl), epochs=epochs)
avg_train_losses = []
avg_val_losses = []
avg_val_scores = []
lr = []
best_avg_val_score = -1000
scaler = torch.cuda.amp.GradScaler() # 混合精度

for epoch in tqdm(range(epochs), total=epochs):
    model.train()
    total_train_loss = 0.0

    for i, (x, y, image_tensor) in enumerate(train_dl):
        x, y, image_tensor = move_to_dev(x, y, image_tensor)
        model.zero_grad()
        with torch.cuda.amp.autocast():
            output = model(x, image_tensor)
            loss = criterion(y, output)
        total_train_loss += loss.item()
        
        # 反向传播和优化
        scaler.scale(loss).backward()
        ###
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        ###
        lr.append(get_lr(optimizer))

可能的原因:

1、如果第一次迭代时出现了梯度为 NaN 的情况(例如,由于过高的比例因子(scale factor)导致梯度溢出),则参数更新 optimizer.step()将被跳过,直接跳到学习率调整scheduler.step(),因此即便optimizer.step()已经在scheduler.step()之前,也可能会收到此警告。

2、torch.optim.lr_scheduler.OneCycleLR()的学习率如下图所示,刚开始学习率非常小,较小的学习率应该会降低梯度溢出的概率,这也可能导致梯度为 NaN 的情况,同样会导致参数更新 optimizer.step()将被跳过,直接跳到学习率调整scheduler.step()

pytorch GradScaler() 出现 UserWarning: Detected call of lr_scheduler.step() before optimizer.step()._第1张图片

总之,如果出现了梯度为NaN的情况,则optimizer.step()将被跳过,直接跳到lr_scheduler.step(),pytorch就会认为你在没有更新参数的情况下调整了学习率,即便你已经把optimizer.step()放在lr_scheduler.step()前,也可能会收到此警告。

解决方法:将代码做如下修改

steps = len(train_dl) * epochs
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, steps_per_epoch=len(train_dl), epochs=epochs)
avg_train_losses = []
avg_val_losses = []
avg_val_scores = []
lr = []
best_avg_val_score = -1000
scaler = torch.cuda.amp.GradScaler() # 混合精度

for epoch in tqdm(range(epochs), total=epochs):
    model.train()
    total_train_loss = 0.0

    for i, (x, y, image_tensor) in enumerate(train_dl):
        x, y, image_tensor = move_to_dev(x, y, image_tensor)
        model.zero_grad()
        with torch.cuda.amp.autocast():
            output = model(x, image_tensor)
            loss = criterion(y, output)
        total_train_loss += loss.item()
        
        # 反向传播和优化
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        ###
        scale = scaler.get_scale()
        scaler.update()
        skip_lr_sched = (scale > scaler.get_scale())
        if not skip_lr_sched:
            scheduler.step()
        ###
        lr.append(get_lr(optimizer))

建议使用:
skip_lr_sched = (scale > scaler.get_scale())
而非
skip_lr_sched = (scale != scaler.get_scale())

因为根据 docs 7, 当 optimizer.step() 被跳过时, scaler.update()会减小 scale_factor;optimizer.step() 未被跳过时,scaler.update()会增加scale_factor

Simply checking scale != scaler.get_scale() will return False even when the scale_factor is increased (and optimizer.step has NOT been skipped), which we don’t want.

参考:

https://discuss.pytorch.org/t/optimizer-step-before-lr-scheduler-step-error-using-gradscaler/92930/5

你可能感兴趣的:(pytorch,深度学习,python)