GTX 16XX系显卡 yolov5训练结果出现NAN的问题

autoanchor: Analyzing anchors... anchors/target = 4.27, Best Possible Recall (BPR) = 0.9935
Image sizes 640 train, 640 val
Using 1 dataloader workers
Logging results to runs\train\test42
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/2     1.86G       nan       nan       nan       113       640: 100%|██████████| 16/16 [00:23<00:00,  1.44s/it]
C:\Users\monst\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|| 8/8 [00:03<00:00,  2.45
                 all        128          0          0          0          0          0

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/2     2.45G       nan       nan       nan       128       640: 100%|██████████| 16/16 [00:17<00:00,  1.08s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|| 8/8 [00:03<00:00,  2.48
                 all        128          0          0          0          0          0

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/2     2.45G       nan       nan       nan       221       640: 100%|██████████| 16/16 [00:17<00:00,  1.09s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|| 8/8 [00:03<00:00,  2.39
                 all        128          0          0          0          0          0

找到了解决方法
https://github.com/ultralytics/yolov5/issues/4839

GTX 16XX系显卡 yolov5训练结果出现NAN的问题_第1张图片

https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel-822

GTX 16XX系显卡 yolov5训练结果出现NAN的问题_第2张图片

CUDA
https://developer.nvidia.cn/cuda-10.2-download-archive?target_os=Windows&target_arch=x86_64
GTX 16XX系显卡 yolov5训练结果出现NAN的问题_第3张图片

cudnn
https://developer.nvidia.com/rdp/cudnn-downloadGTX 16XX系显卡 yolov5训练结果出现NAN的问题_第4张图片

安装了CUDA10.2还是不行,看别人pytorch是1.9.1的,我是1.8.1的,就去升级了pytorch(用anaconda升的),自动升到了1.10.1。
但是升完直接找不到cuda了,torch.version.cuda显示None,torch和cuda版本不匹配。
又在anaconda发现cudatoolkit库还是11.1,没降级,然后去把cudatoolkit从11.1降到了10.2,还是不行。
无奈,最后直接把原来的环境删了,重新装了一个,正好CUDA10.2的,pytorch也是1.10.1。

Pytorch
https://pytorch.org/
conda install pytorch == 1.10.1 torchvision == 0.11.2 torchaudio == 0.10.1 cudatoolkit=10.2 -c pytorch
等号两边的空格去掉,有六个GTX 16XX系显卡 yolov5训练结果出现NAN的问题_第5张图片

Anaconda安装pytorch
https://blog.csdn.net/qq_45297730/article/details/121652951

只能说重装解决一切

更多关于NAN的讨论
https://github.com/ultralytics/yolov5/issues/4084
https://github.com/ultralytics/yolov5/issues/1625
https://github.com/ultralytics/yolov5/issues/1749

你可能感兴趣的:(深度学习,计算机视觉,pytorch)