系统信息:
- ubantu 16.0.4,
- nvidia1080ti卡,
- 当前已经安装的cuda 版本(cuda 9.0+cudnn7.0.5;cuda 8.0+cudnn5.1.10)
- 当前nvidia driver:384.90
- tensorflow-gpu : 2.1.0
安装:
完全按照机器之心这篇文章,唯一需要修改的地方,就是注意nvidia驱动的版本,cuda的版本,以及与cuda 对应的cudnn 版本号。
我安装的nvidia440+cuda10.2+cudnn 7.6.5+tf-gpu 2.1 出现了一下三个问题。
1号坑(有一种来看秦始皇陵兵马俑的感觉,):
2020-03-06 19:54:56.304782: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
2020-03-06 19:54:56.304857: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
2020-03-06 19:54:56.304869: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
· 上面的提示我没有理睬,这个是没有安装TensorRT的原因
2号坑
Could not load dynamic library 'libcudart.so.10.1'; dlerror:libcudart.so.10.1 cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
· 这个浪费我很久,因为一旦不解决,使用tf.config.list_physical_devices('GPU') 是不返回任何gpu ,说明程序看不到GPU的。在stackOverflow 上查找,很多人说cuda 10.1不支持tf2.1 ,说需要更换成cuda 10.0,我没有这样做,因为这样做,前面的显卡驱动,cuda 有需要重新装,太麻烦。我的解决办法是在/usr/local/cuda/lib64(这个路径存在 libcudart.so.10.12)中设置了一个软链接 ,使其只想libcudart.so.10.2,完美解决。
ln -s libcudart.so.10.1 libcudart.so.10.2
3号坑
Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
· 这个有了2号坑的经验,就简单的多了,在cuda-9.0的lib64中找到了 libcudnn.so.7,然后复制到cuda-10.0的lib64中,
测试
print(tf.config.list_physical_devices('GPU'))
输出结果:完美。
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]