原文地址 : https://blog.hidavid.cn/cuda-cudnn-install-successful/
最近又捡起YOLOv3来练练手,检测医学B超图像。
重新搭建环境,由于网速时快时慢,搭建起来相当痛苦,最终还是搭建完成了。
下面分享下如何判断CUDA是否正常使用。
首先是判断cuda是否安装成功。
一般安装路径为/usr/local/cuda
使用nvcc -v
命令可以输出cuda版本
然后是判断cudnn,这个库安装很简单,只需把cudnn的include和lib64里面的文件拷到cuda相应目录即可,所以判断是否安装的方式是,到cuda的include和lib64,用ls | grep cudnn
命令查看是否有cudnn相关的文件。
判断方式很多,我以使用tensorflow为例。
在启动tensorflow的时候,会有下面的log,这能看出来cuda和cudnn的库都顺利加载进来了。
2020-10-28 13:23:32.729663: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-10-28 13:23:32.732189: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-10-28 13:23:32.734612: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-10-28 13:23:32.735064: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-10-28 13:23:32.737413: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-10-28 13:23:32.739131: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-10-28 13:23:32.744086: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-10-28 13:23:32.754026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2020-10-28 13:23:32.754078: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-10-28 13:23:32.759785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-28 13:23:32.759804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1
2020-10-28 13:23:32.759842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y
2020-10-28 13:23:32.759868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N
2020-10-28 13:23:32.767512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30265 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8b:00.0, compute capability: 7.0)
2020-10-28 13:23:32.770834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30265 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8d:00.0, compute capability: 7.0)
2020-10-28 13:24:08.051764: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
在训练过程,也可以通过观察CPU和GPU的使用情况来判断。
比如输入top
指令,可以试试查看cpu和mem的使用情况,可看出cpu使用率还挺高的,由于多核的原因,使用率超过100%了。我在没使用GPU训练的时候,没记错的话cpu占用率接近800%。
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5374 root 20 0 65.426g 6.152g 1.499g S 226.6 1.2 27:42.84 python
然后是nvidia-smi
命令,查看gpu使用情况,下表可看出gpu显存的使用率为76%,那就表示GPU正被使用了。当GPU显存使用率接近100%,tensorflow就会蹦了,此时一般要降低batchsize来处理。
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:8B:00.0 Off | 0 |
| N/A 58C P0 160W / 300W | 31361MiB / 32480MiB | 76% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:8D:00.0 Off | 0 |
| N/A 45C P0 59W / 300W | 625MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|