cuda 9.2 / cudnn 7.2 / python3.5 / tensorflow 1.8
root$ source activate python3.5
查看版本命令
root$cat /usr/local/cuda/version.txt
下载链接:
https://developer.nvidia.com/cuda-92-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal
下载版本
sudo sh cuda_9.2.148_396.37_linux.run
安装指南:
https://developer.nvidia.com/cuda-92-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal
验证方式参考:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-installation
测试命令:
编译:
$cd ~/NVIDIA_CUDA-9.2_Sample;
$make;
测试:cd ~/NVIDIA_CUDA-9.2_Sample ; ./bin/x86_64/linux/release/deviceQuery
验证结果:
测试通过
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 8
Result = PASS
结果:
PASS
查看版本命令
root#cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
下载链接:
https://developer.nvidia.com/rdp/cudnn-archive
安装文档:
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
下载版本:
cuDNN v7.1.4 Runtime Library for Ubuntu16.04 (Deb)
cuDNN v7.1.4 Developer Library for Ubuntu16.04 (Deb)
cuDNN v7.1.4 Code Samples and User Guide for Ubuntu16.04 (Deb)
安装命令:
dpkg -i cuDNN v7.1.4 Runtime Library for Ubuntu16.04 (Deb)
dpkg -i cuDNN v7.1.4 Developer Library for Ubuntu16.04 (Deb)
dpkg -i cuDNN v7.1.4 Code Samples and User Guide for Ubuntu16.04 (Deb)
sudo apt update
环境变量:
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
校验:
`cd ~/cudnn_samples_v7/mnistCUDNN
make
./mnistCUDNN
Bus error (core dumped)``
结果:
FAILED
‘’‘7.1 失败了,重新下载 7.2’’’
查看版本命令
root#cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
下载链接:
https://developer.nvidia.com/rdp/cudnn-archive
安装文档:
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
下载版本:
cuDNN v7.2.1 Runtime Library for Ubuntu16.04 (Deb)
cuDNN v7.2.1 Developer Library for Ubuntu16.04 (Deb)
cuDNN v7.2.1 Code Samples and User Guide for Ubuntu16.04 (Deb)
安装命令:
dpkg -i libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb
dpkg -i libcudnn7-doc_7.2.1.38-1+cuda9.2_amd64.deb
dpkg -i libcudnn7-dev_7.2.1.38-1+cuda9.2_amd64.deb
sudo apt update
校验:
cd ~/cudnn_samples_v7/mnistCUDNN make ./mnistCUDNN
结果
Result of classification: 1 3 5
Test passed!
拷贝
参考:
https://github.com/tensorflow/tensorflow/issues/2626
https://blog.csdn.net/xuezhisdc/article/details/48651003
希望cudnn被tensorflow编译时发现,需要将生成的文件拷贝至 /usr/local/cuda/lib64/
以及 /usr/local/cuda/include/
查看:
step1:
先看一下 /usr/lib/x86_64-linux-gnu/
下文件是否完整
(python3.\*) root@:/# ll /usr/lib/x86_64-linux-gnu/libcudnn*
lrwxrwxrwx 1 root root 29 9月 27 09:29 /usr/lib/x86_64-linux-gnu/libcudnn.so -> /etc/alternatives/libcudnn_so
lrwxrwxrwx 1 root root 17 7月 31 14:54 /usr/lib/x86_64-linux-gnu/libcudnn.so.7 -> libcudnn.so.7.2.1
-rw-r--r-- 1 root root 288585696 7月 31 14:54 /usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
lrwxrwxrwx 1 root root 32 9月 27 09:29 /usr/lib/x86_64-linux-gnu/libcudnn_static.a -> /etc/alternatives/libcudnn_stlib
-rw-r--r-- 1 root root 281810850 7月 31 14:54 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
step2:
如果完整,将这些文件拷贝至 /usr/local/cuda/lib64
cp /usr/lib/x86_64-linux-gnu/libcudnn* /usr/local/cuda/lib64/
step3:
需要把cudnn.h拷贝到 cuda 路径下
`sudo cp cudnn.h /usr/local/cuda/include/``
理论上应重新下载 7.2.1的 cudnn.h 文件,但我保留了 7.1 的 cudnn.h 安装未见异常。
如果有问题,下载的话可以到
(python3.5) root$ chomd +x bazel-0.16.0-installer-linux-x86_64.sh
地址:
https://developer.nvidia.com/nccl/nccl-download
版本:
nccl-repo-ubuntu1604-2.2.12-ga-cuda9.2_1-1_amd64.deb
文档:
https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html#down
命令:
$dpkg -i nccl-repo-ubuntu1604-2.2.12-ga-cuda9.2_1-1_amd64.deb
$sudo apt update
$sudo apt install libnccl2 libnccl-dev
然后把libnccl.so.2 和nccl.h拷贝到对应路径。
$cp /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda-9.2/lib/libnccl.so.2
$cp /usr/include/nccl.h /usr/local/cuda-9.2/include/
完成。
文档:
https://www.tensorflow.org/install/source
配置问题1:
configure 过程中出现找不到 nccl
重新安装了nccl,参考5
配置:
./configuer
编译
$ bazel build --config=opt --config=cuda tensorflow/tools/pip_package:build_pip_package --action_env="LD_LIBRARY_PATH=${LD_LIBRARY_PATH}"
结果
看起来是成功了
INFO: From Executing genrule //tensorflow/python/estimator/api:estimator_python_api_gen:
tf.estimator package not installed.
tf.estimator package not installed.
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 668.037s, Critical Path: 240.01s
INFO: 8021 processes: 8021 local.
INFO: Build completed successfully, 10483 total actions
错误
但运行下面命令出现错误:
python -c 'import tensorflow as tf; print(tf.__version__)'
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
看起来需要安装一下 cublas了。
在设备上find / -name libcublas* ,发现在在libcublas 该有的都有:
/root/anaconda3/lib/libcublas.so.9.0
/root/anaconda3/lib/libcublas.so
/root/anaconda3/lib/libcublas.so.9.0.176
/root/anaconda3/pkgs/cudatoolkit-9.0-h13b8566_0/lib/libcublas.so.9.0
/root/anaconda3/pkgs/cudatoolkit-9.0-h13b8566_0/lib/libcublas.so
/root/anaconda3/pkgs/cudatoolkit-9.0-h13b8566_0/lib/libcublas.so.9.0.176
/usr/lib/x86_64-linux-gnu/libcublas_static.a
/usr/lib/x86_64-linux-gnu/libcublas.so.7.5.18
/usr/lib/x86_64-linux-gnu/stubs/libcublas.so
/usr/lib/x86_64-linux-gnu/libcublas_device.a
/usr/lib/x86_64-linux-gnu/libcublas.so
/usr/lib/x86_64-linux-gnu/libcublas.so.7.5
/usr/share/doc/libcublas7.5
/usr/share/lintian/overrides/libcublas7.5
/usr/share/man/man7/libcublas.so.7.gz
/usr/share/man/man7/libcublas.so.7
/usr/share/man/man7/libcublas.7.gz
/usr/share/man/man7/libcublas.7
/usr/local/cuda-9.2/doc/man/man7/libcublas.so.7
/usr/local/cuda-9.2/doc/man/man7/libcublas.7
/usr/local/cuda-9.2/lib64/libcublas_static.a
/usr/local/cuda-9.2/lib64/libcublas.so.9.2
/usr/local/cuda-9.2/lib64/stubs/libcublas.so
/usr/local/cuda-9.2/lib64/libcublas.so.9.2.148
/usr/local/cuda-9.2/lib64/libcublas_device.a
/usr/local/cuda-9.2/lib64/libcublas.so
/usr/local/cuda-9.0/doc/man/man7/libcublas.so.7
/usr/local/cuda-9.0/doc/man/man7/libcublas.7
/usr/local/cuda-9.0/lib64/libcublas_static.a
/usr/local/cuda-9.0/lib64/stubs/libcublas.so
/usr/local/cuda-9.0/lib64/libcublas.so.9.0
/usr/local/cuda-9.0/lib64/libcublas_device.a
/usr/local/cuda-9.0/lib64/libcublas.so
/usr/local/cuda-9.0/lib64/libcublas.so.9.0.176
/var/lib/dpkg/info.bak/libcublas7.5:amd64.shlibs
/var/lib/dpkg/info.bak/libcublas7.5:amd64.symbols
/var/lib/dpkg/info.bak/libcublas7.5:amd64.list
/var/lib/dpkg/info.bak/libcublas7.5:amd64.triggers
/var/lib/dpkg/info.bak/libcublas7.5:amd64.md5sums
于是思考,是不是环境没有正确加载到 libcublas* ,
参考这个链接 https://github.com/tensorflow/tensorflow/issues/15604
运行了这两条命令:
$bash -c "echo /usr/local/cuda/lib64/ >/etc/ld.so.conf.d/cuda.conf"
$ldconfig
然后试了一下
$python -c “import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))”
仍然报libcublas找不到的错误。
于是想了一下,我在anaconda下安装的,而这个env之前pip install tensorflow-gpu
安装过一次,难道现在引入的是之前的 tensorflow ?
然后通过pip list
看了一下,果然有一个 tensorflow , 于是 source deactivate 之后 ,
运行了$python -c “import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))”
,
发现在 anaconda之外是OK的,但版本是1.7?
所以,我再次激活了python3.5
,通过pip uninstall tensorflow
卸载了之前的tf,又跑了一遍python -c 'import tensorflow as tf; print(tf.__version__)'
,发现正常了,但版本仍然是1.7。
表示现在1.7版本可用了,再用$python -c 'import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))'
命令测了一下,现实8卡P100可用了。
2018-09-27 12:36:10.774651: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-09-27 12:36:43.384597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:88:00.0
totalMemory: 15.90GiB freeMemory: 360.88MiB
2018-09-27 12:36:43.643579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:8d:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:36:43.903969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 2 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:8e:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:36:44.170843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 3 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:8f:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:37:02.001743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 4 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b2:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:37:02.286188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 5 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b3:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:37:02.574992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 6 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b5:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:37:02.578074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6
Thu Sep 27 12:36:42 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:88:00.0 Off | Off |
| N/A 27C P0 30W / 250W | 15637MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:8D:00.0 Off | Off |
| N/A 27C P0 29W / 250W | 15487MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:8E:00.0 Off | Off |
| N/A 29C P0 29W / 250W | 15487MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:8F:00.0 Off | Off |
| N/A 26C P0 29W / 250W | 15487MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P100-PCIE... Off | 00000000:B2:00.0 Off | Off |
| N/A 25C P0 29W / 250W | 15487MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P100-PCIE... Off | 00000000:B3:00.0 Off | Off |
| N/A 29C P0 31W / 250W | 15487MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P100-PCIE... Off | 00000000:B5:00.0 Off | Off |
| N/A 28C P0 32W / 250W | 15487MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 63644 C python 15625MiB |
| 1 63644 C python 15475MiB |
| 2 63644 C python 15475MiB |
| 3 63644 C python 15475MiB |
| 4 63644 C python 15475MiB |
| 5 63644 C python 15475MiB |
| 6 63644 C python 15475MiB |
+-----------------------------------------------------------------------------+
OK,基本装好。
现在回头解决1.7版本的问题。 在未激活 anaconda 状态下是,查看版本是1.7,anaconda激活后,也是1.7,所以怀疑
是在未激活状态下,默认用pip 安装了 1.7 tensorflow-gpu。为了验证,我激活了python3.5 conda,然后重新编译
tensorflow。