Tensorflow运行环境的cuda+cudnn版本问题

问题

CentOS Linux release 7.3.1611服务器上以前装过tensorflow1.0,cuda8.0,cudnn v5.1,原本是能正常运行tf程序,一段时间没用,出了点小问题,故查资料解决一下

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcudnn.so.5. LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/lib:
I tensorflow/stream_executor/cuda/cuda_dnn.cc:3517] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
···
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:82:00.0
Total memory: 11.17GiB
Free memory: 2.08GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:82:00.0)
F tensorflow/stream_executor/cuda/cuda_dnn.cc:222] Check failed: s.ok() could not find cudnnCreate in cudnn DSO; dlerror: /usr/local/python3/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cudnnCreate
Aborted (core dumped)

说 libcudnn.so.5 找不到,到 /usr/local/cuda-8.0/lib64 目录下查看
Tensorflow运行环境的cuda+cudnn版本问题_第1张图片
确实没有,而且cudnn以前升过级,现在系统里装了6和7两个版本,没有5怎么办呢,没关系,建个软链接就行 ln -s libcudnn.so.7 libcudnn.so.5
然而,

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
···
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:82:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:390] Loaded runtime CuDNN library: 7004 (compatibility version 7000) but source was compiled with 5105 (compatibility version 5100).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Aborted (core dumped)

还是出错,说cudnn7不兼容,要求cudnn5.1。意思是版本太高了?我查了一些别人博客,大部分都是说cudnn版本低不兼容的;然后又到cuda官网查了一下cuda 8.0对应cudnn版本
Tensorflow运行环境的cuda+cudnn版本问题_第2张图片
看来是不对,我直接换成了 ln -s libcudnn.so.6 libcudnn.so.5
Tensorflow运行环境的cuda+cudnn版本问题_第3张图片
然后程序成功运行。

总结

两个问题,cudnn库不存在,和cudnn库版本不对。
解决办法虽然简单,但要多注意,搞GPU计算环境时,系统版本、显卡计算能力、cuda版本、cudnn版本,这些东西的匹配问题。

参考:
[1] Computer Vision & Machine Learning

你可能感兴趣的:(TensorFlow,Keras,Linux)