Debug Pytorch: RuntimeError: CUDA error: device-side assert triggered

现象

[2023-06-07 09:12:04]: 
-----------------FOLD: 1--------------------
[2023-06-07 09:12:04]: 
-----------------FOLD: 1--------------------
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
 in 
     17 
     18     model = BERTEncoder(CFG.n_skill, embed_size=CFG.embed_dim)
---> 19     model.to(CFG.device)
     20 
     21     # optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0.99, weight_decay=0.005)

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
    605             return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
    606 
--> 607         return self._apply(convert)
    608 
    609     def register_backward_hook(

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    352     def _apply(self, fn):
    353         for module in self.children():
--> 354             module._apply(fn)
    355 
    356         def compute_should_use_set_data(tensor, tensor_applied):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    374                 # `with torch.no_grad():`
    375                 with torch.no_grad():
--> 376                     param_applied = fn(param)
    377                 should_use_set_data = compute_should_use_set_data(param, param_applied)
    378                 if should_use_set_data:

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t)
    603             if convert_to_format is not None and t.dim() == 4:
    604                 return t.to(device, dtype if t.is_floating_point() else None, non_blocking, memory_format=convert_to_format)
--> 605             return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
    606 
    607         return self._apply(convert)

RuntimeError: CUDA error: device-side assert triggered

以上为报错信息

A CUDA error: device-side assert triggered is an error that’s often caused when you either have inconsistency between the number of labels and output units or you input the loss function incorrectly. To solve it, you need to make sure your output units match the number of classes and that your output layer returns values in the range of the loss function (criterion) that you chose.

排查

  • 首先采用普通的torch.randint测试,发现报同样的错误,可知是模型的问题

  • 网上其他排查主要是模型的shape不匹配问题

    • label数量和输出数量不符合,或者loss函数不对。
    • 或者NLP中tokenizer输出维度与模型vocabulary size不匹配
  • 但我这里只要to device就报错了。把to device注释掉,可以解决问题。但用不了GPU了

  • 因此排查的时候可以先把device设置为cpu,

model = BERTEncoder(CFG.n_skill, embed_size=CFG.embed_dim)
#model.to(CFG.device)

# test model
x = torch.randint(0, 10, (2, 99))#.to(CFG.device)
ids = torch.randint(0, 10, (2, 99))#.to(CFG.device)
y = model(x, ids)
print(y[0].size(), y[1].size())

如果是自己机器

CUDA_LAUNCH_BLOCKING=1 python script.py args

或者能重启(比如kaggle notebook)也就解决了

[https://discuss.pytorch.org/t/how-to-fix-cuda-error-device-side-assert-triggered-error/137553](https://discuss.pytorch.org/t/how-to-fix-cuda-error-device-side-assert-triggered-error/137553

你可能感兴趣的:(pytorch,python,深度学习)