pytorch训练跑着好好的, 断了:
Traceback (most recent call last):
File "main_multi_model_test.py", line 147, in <module>
main()
File "main_multi_model_test.py", line 119, in main
train_loss, train_acc, train_bacc = train(model, optimizer, train_loader, criterions, taskNum, )
File "../train.py", line 65, in train
optimizer.step()
File "/home/user1/miniconda3/lib/python3.7/site-packages/torch/optim/adam.py", line 93, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: CUDA error: an illegal instruction was encountered
以下是训练代码:
def train(model, optimizer, train_loader, criterions, taskNum, ):
model.train()
# taskNum = 40
taskAttrNum = 40 // taskNum # num of attrs to pred in one task
train_loss, corrects, tns, tps, Nns, Nps = [0] * taskNum, [0] * taskNum, [0] * taskNum, [0] * taskNum, [0] * taskNum, [0] * taskNum
samplesNum = 0
for i, (inputs, labels) in enumerate(train_loader):
inputs = inputs.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
outputs = model(inputs)
batch_size = labels.size(0)
loss, correct, tn, tp, Nn, Np = acc_bacc(criterions, outputs, labels, taskNum)
loss_sum = sum(loss)
optimizer.zero_grad()
loss_sum.backward()
optimizer.step()
train_loss = [train_loss[i]+loss[i] for i in range(taskNum)]
corrects = [corrects[i]+correct[i] for i in range(taskNum)]
tns = [tns[i]+tn[i] for i in range(taskNum)]
tps = [tps[i]+tp[i] for i in range(taskNum)]
Nns = [Nns[i]+Nn[i] for i in range(taskNum)]
Nps = [Nps[i]+Np[i] for i in range(taskNum)]
samplesNum += batch_size
acc = [100 * corrects[i] / (samplesNum * taskAttrNum) for i in range(taskNum)]
bacc = list_bacc(tns, tps, Nns, Nps)
assert len(train_loss) == len(acc) == len(bacc) == taskNum
loss_avg = sum(train_loss)/ (samplesNum*taskNum)
acc_avg = sum(acc) / len(acc)
bacc_avg = sum(bacc) / len(bacc)
return loss_avg, acc_avg, bacc_avg
训练设置为:
python3.7
-lr 0.0001
-gpu 2
Model:Arcface, TaskNum:40, Bits/Attr:1
ir_se_50 model generated
Loaded weights /home/user1/Downloads/model_ir_se50.pth
Use CelebA+Lfw-a train set weights
Using image size: 112
Using Mixed CelebA train + 80% Lfw-a train set
Using Mixed CelebA Val + 20% Lfw-a Train as Validation set
Save in ckpt/0716114937_Arcface_t40_bs128_lr0.0001
奇怪的是,同样的代码, 跑别的参数设置时没有断…
可能的解决方法:
1, 换机器