https://blog.csdn.net/didiaopao/category_11321656.html?spm=1001.2014.3001.5482前景概要:cuda与gpu的版本不匹配导致训练时模型出现错误。
提示:本文主要记录了在模型训练中所产生的问题。
模型训练以及在验证集上执行的代码如下所示,这是开启训练模型大门的钥匙。
python train.py --data mask_data.yaml --cfg mask_yolov5s.yaml --weights pretrained/yolov5s.pt --epoch 50 --batch-size 10 python val.py --data data/mask_data.yaml --weights runs/train/exp35/weights/best.pt --img 640
注意该代码要退出当前python环境才能运行
退出的方法是在终端输入
quit()
训练中出现的问题:
训练集上的问题
Epoch gpu_mem box obj cls labels img_size 0/0 2.61G nan nan nan 19 320: 57%|█████▋ | 123/215 [01:42<01:16, 1.20it/s]
验证集上的问题
48/49 2.62G nan nan nan 4 736: 100%|██████████| 215/215 [02:48<00:00, 1.28it/s] • Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 29/29 [00:12<00:00, 2.34it/s] • all 346 0 0 0 0 0
该问题是由于GTX1650显卡和torch中cndnn版本锁导致得不兼容问题,模型根本就没有训练出来,更别谈什么验证集上的事情了。
解决方法:需要重新安装相应版本的torch以及cnda。
问题解决的过程:
1.尝试了如下的代码
pip3 install torch==1.10.2+cu102 torchvision==0.11.3+cu102 torchaudio===0.10.2+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html
之后又出现了如下的问题
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'e:\\graduation\\miniconda\\miniconda\\envs\\m ask\\lib\\site-packages\\numpy-1.22.3.dist-info\\METADATA'
解决方法一:就是在该路径的文件夹下面,创建了一个METADATA的文件夹(没用)
解决方法二:直接将该路径下面的另一个numpy-1.21.5.dist-info的内容复制到numpy-1.22.3.dist-info文件夹中,完美解决。
2.现在看一下显卡所支持的cnda的版本
(之后还下载了最新的显卡驱动,安装上去,不过这一步其实并没有什么卵用)
查看显卡的版本(更新前),使用如下命令
nvidia-smi
C:\Users\kx>nvidia-smi Fri Apr 15 19:54:00 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 497.33 Driver Version: 497.33 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A | | N/A 35C P8 3W / N/A | 134MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 4480 C+G ...bbwe\PaintStudio.View.exe N/A | +-----------------------------------------------------------------------------+
C:\Users\kx>nvidia-smi Fri Apr 15 20:00:16 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 512.15 Driver Version: 512.15 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A | | N/A 40C P8 6W / N/A | 0MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
查看cnda的版本(安装新的显卡驱动前)
import torch >>> print(torch.cuda.is_available()) True >>> print(torch.backends.cudnn.is_available()) True >>> print(torch.cuda_version) 11.3 >>> print(torch.backends.cudnn.version()) 8200
import torch print(torch.cuda.is_available()) print(torch.backends.cunn.is_available()) print(torch.cuda_version) print(torch.backends.cudnn.version())
请复制粘贴该命令即可查看。
结果 如下好像是已经安装成功了,现在在来训练一下
>>> print(torch.cuda_version) 10.2 >>> print(torch.backends.cudnn.version()) 7605
3.记录4月16日的训练结果
Model Summary: 213 layers, 7015519 parameters, 0 gradients, 15.8 GFLOPs Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 29/29 [00:19<00:00, 1.45it/s] all 346 486 0.896 0.874 0.91 0.638 no-mask 346 107 0.822 0.822 0.848 0.557 mask 346 379 0.97 0.926 0.972 0.718 Results saved to runs\train\exp35
训练成功。
(番外)之后我不死心,又把验证集的代码又重新执行了一遍
python val.py --data data/mask_data.yaml --weights runs/train/exp35/weights/best.pt --img 640
在运行验证集的时候所产生的报错:
[email protected]:.95: 100%|██████████| 11/11 [00:06<00:00, 1.83it/s] OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degra e been linked into the program. That is dangerous, since it can degto ensure that only a single OpenMP runtime is linked into the proces rade performance or cause incorrect results. The best thing to do iOMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. s, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently p roduce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
结果没有出现最终的指标的结果。
解决方法是在相应的python文件里面增加如下的代码:
import os os.environ['KMP_DUPLICATE_LIB_OK']='True'
该问题即可解决。
该问题主要是因为由于libiomp5md.dll副本的存在,需要把numpy卸载干净之后重新装。
采用如下的命令
pip uninstall numpy pip install numpy
这一步没有尝试。
心得:如何处理好与bug之前的关系,是作为一位程序员的的必修之课。bugs是永无止境的,让bugs来的更猛烈写吧。
特别鸣谢以下大佬
炮哥带你学的博客_CSDN博客-目标检测--手把手教你搭建自己的YOLOv5目标检测平台,目标追踪领域博主
肆十二的博客_CSDN博客-大作业,目标检测,个人心得领域博主