【Linux多机多卡训练步骤四】训练过程中的报错

报错1

训练过程中,如果报错如下:
[2023/06/05 10:56:42] ppcls INFO: [Train][Epoch 1/200][Iter: 0/151]lr(CosineAnnealingDecay): 0.00100000, top1: 0.03125, top5: 1.00000, CELoss: 1.79536, loss: 1.79536, batch_cost: 35.73629s, reader_cost: 0.38432, ips: 0.89545 samples/s, eta: 12 days, 11:47:15
[2023/06/05 11:31:35] ppcls INFO: [Train][Epoch 1/200][Iter: 100/151]lr(CosineAnnealingDecay): 0.00099997, top1: 0.69431, top5: 1.00000, CELoss: 1.07162, loss: 1.07162, batch_cost: 20.96136s, reader_cost: 0.00091, ips: 1.52662 samples/s, eta: 7 days, 7:15:36
[2023/06/05 11:48:19] ppcls INFO: [Train][Epoch 1/200][Avg]top1: 0.71164, top5: 1.00000, CELoss: 1.02763, loss: 1.02763
Traceback (most recent call last):
File “tools/train.py”, line 35, in
engine.train()
File “/media/uvtec/WORK/PaddleClas/ppcls/engine/engine.py”, line 372, in train
acc = self.eval(epoch_id)
File “/home/uvtec/miniconda3/envs/paddle_clas_muti/lib/python3.8/site-packages/decorator.py”, line 232, in fun
return caller(func, *(extras + args), **kw)
File “/home/uvtec/miniconda3/envs/paddle_clas_muti/lib/python3.8/site-packages/paddle/fluid/dygraph/base.py”, line 375, in _decorate_function
return func(*args, **kwargs)
File “/media/uvtec/WORK/PaddleClas/ppcls/engine/engine.py”, line 451, in eval
eval_result = self.eval_func(self, epoch_id)
File “/media/uvtec/WORK/PaddleClas/ppcls/engine/evaluation/classification.py”, line 99, in classification_eval
paddle.distributed.all_gather(pred_list, out)
File “/home/uvtec/miniconda3/envs/paddle_clas_muti/lib/python3.8/site-packages/paddle/distributed/collective.py”, line 953, in all_gather
task = group.process_group.all_gather(tensor, out)
OSError: (External) NCCL error(5), invalid usage. Detail: Resource temporarily unavailable
Please try one of the following solutions:

  1. export NCCL_SHM_DISABLE=1;
  2. export NCCL_P2P_LEVEL=SYS;
  3. Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting -shm-size=2g.

[Hint: Please search for the error code(5) on website (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclresult-t) to get Nvidia’s official solution and advice about NCCL Error.] (at /paddle/paddle/fluid/platform/device/gpu/nccl_helper.h:117)

解决方案1

那么,就按照提示中的一种方案进行解决。
直接在主机和从机的命令行中,输入:

export NCCL_SHM_DISABLE=1

你可能感兴趣的:(linux,运维,服务器)