Tensorflow 1.8 源码安装

Tensorflow源码安装

cuda 9.2 / cudnn 7.2 / python3.5 / tensorflow 1.8


1. anacanda

  • 启动命令
    root$ source activate python3.5

2. cuda

  • 查看版本命令
    root$cat /usr/local/cuda/version.txt

  • 下载链接:
    https://developer.nvidia.com/cuda-92-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

  • 下载版本
    sudo sh cuda_9.2.148_396.37_linux.run

  • 安装指南:
    https://developer.nvidia.com/cuda-92-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

  • 验证方式参考:
    https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-installation

  • 测试命令:
    编译:
    $cd ~/NVIDIA_CUDA-9.2_Sample;
    $make;
    测试:cd ~/NVIDIA_CUDA-9.2_Sample ; ./bin/x86_64/linux/release/deviceQuery

  • 验证结果:
    测试通过

    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 8
    Result = PASS
    
  • 结果:
    PASS

3. cudnn

3.1 cudnn 7.1 安装 [失败]

  • 查看版本命令
    root#cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

  • 下载链接:
    https://developer.nvidia.com/rdp/cudnn-archive

  • 安装文档:
    https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

  • 下载版本:

    cuDNN v7.1.4 Runtime Library for Ubuntu16.04 (Deb)
    cuDNN v7.1.4 Developer Library for Ubuntu16.04 (Deb)
    cuDNN v7.1.4 Code Samples and User Guide for Ubuntu16.04 (Deb)
    
  • 安装命令:

    dpkg -i cuDNN v7.1.4 Runtime Library for Ubuntu16.04 (Deb)
    dpkg -i cuDNN v7.1.4 Developer Library for Ubuntu16.04 (Deb)
    dpkg -i cuDNN v7.1.4 Code Samples and User Guide for Ubuntu16.04 (Deb)
    

    sudo apt update

  • 环境变量:

    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
    
  • 校验:
    `cd ~/cudnn_samples_v7/mnistCUDNN
    make
    ./mnistCUDNN
    Bus error (core dumped)``

  • 结果:
    FAILED
    ‘’‘7.1 失败了,重新下载 7.2’’’

3.2cudnn 7.2 安装

  • 查看版本命令
    root#cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

  • 下载链接:
    https://developer.nvidia.com/rdp/cudnn-archive

  • 安装文档:
    https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

  • 下载版本:

    cuDNN v7.2.1 Runtime Library for Ubuntu16.04 (Deb)
    cuDNN v7.2.1 Developer Library for Ubuntu16.04 (Deb)
    cuDNN v7.2.1 Code Samples and User Guide for Ubuntu16.04 (Deb)
    
  • 安装命令:

    dpkg -i libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb
    dpkg -i libcudnn7-doc_7.2.1.38-1+cuda9.2_amd64.deb
    dpkg -i libcudnn7-dev_7.2.1.38-1+cuda9.2_amd64.deb
    

    sudo apt update

  • 校验:

    • 命令
      cd ~/cudnn_samples_v7/mnistCUDNN make ./mnistCUDNN
  • 结果

    Result of classification: 1 3 5
    Test passed!
    
    • 拷贝
      参考:
      https://github.com/tensorflow/tensorflow/issues/2626
      https://blog.csdn.net/xuezhisdc/article/details/48651003

      希望cudnn被tensorflow编译时发现,需要将生成的文件拷贝至 /usr/local/cuda/lib64/
      以及 /usr/local/cuda/include/

      • 查看:
        step1:
        先看一下 /usr/lib/x86_64-linux-gnu/ 下文件是否完整

        (python3.\*) root@:/# ll /usr/lib/x86_64-linux-gnu/libcudnn*
        lrwxrwxrwx 1 root root        29 9月  27 09:29 /usr/lib/x86_64-linux-gnu/libcudnn.so -> /etc/alternatives/libcudnn_so
        lrwxrwxrwx 1 root root        17 7月  31 14:54 /usr/lib/x86_64-linux-gnu/libcudnn.so.7 -> libcudnn.so.7.2.1
        -rw-r--r-- 1 root root 288585696 7月  31 14:54 /usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
        lrwxrwxrwx 1 root root        32 9月  27 09:29 /usr/lib/x86_64-linux-gnu/libcudnn_static.a -> /etc/alternatives/libcudnn_stlib
        -rw-r--r-- 1 root root 281810850 7月  31 14:54 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
        

        step2:
        如果完整,将这些文件拷贝至 /usr/local/cuda/lib64
        cp /usr/lib/x86_64-linux-gnu/libcudnn* /usr/local/cuda/lib64/

        step3:
        需要把cudnn.h拷贝到 cuda 路径下
        `sudo cp cudnn.h /usr/local/cuda/include/``
        理论上应重新下载 7.2.1的 cudnn.h 文件,但我保留了 7.1 的 cudnn.h 安装未见异常。
        如果有问题,下载的话可以到

4. bazel install

  • 路径: /home/royal/thirdparty/
  • 文件: /home/royal/thirdparty/bazel-0.16.0-installer-linux-x86_64.sh
  • 命令:

(python3.5) root$ chomd +x bazel-0.16.0-installer-linux-x86_64.sh

5. nccl 2

  • 地址:
    https://developer.nvidia.com/nccl/nccl-download

  • 版本:
    nccl-repo-ubuntu1604-2.2.12-ga-cuda9.2_1-1_amd64.deb

  • 文档:
    https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html#down

  • 命令:

    $dpkg -i nccl-repo-ubuntu1604-2.2.12-ga-cuda9.2_1-1_amd64.deb
    $sudo apt update
    $sudo apt install libnccl2 libnccl-dev
    

    然后把libnccl.so.2 和nccl.h拷贝到对应路径。

    $cp /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/local/cuda-9.2/lib/libnccl.so.2
    $cp /usr/include/nccl.h /usr/local/cuda-9.2/include/
    
  • 完成。

6. tensorflow

  • 文档:
    https://www.tensorflow.org/install/source

  • 配置问题1:
    configure 过程中出现找不到 nccl
    重新安装了nccl,参考5

  • 配置:
    ./configuer

  • 编译
    $ bazel build --config=opt --config=cuda tensorflow/tools/pip_package:build_pip_package --action_env="LD_LIBRARY_PATH=${LD_LIBRARY_PATH}"

  • 结果
    看起来是成功了

    INFO: From Executing genrule //tensorflow/python/estimator/api:estimator_python_api_gen:
    tf.estimator package not installed.
    tf.estimator package not installed.
    Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
      bazel-bin/tensorflow/tools/pip_package/build_pip_package
    INFO: Elapsed time: 668.037s, Critical Path: 240.01s
    INFO: 8021 processes: 8021 local.
    INFO: Build completed successfully, 10483 total actions
    
  • 错误
    但运行下面命令出现错误:
    python -c 'import tensorflow as tf; print(tf.__version__)'

    ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
    

    看起来需要安装一下 cublas了。

libcublas

在设备上find / -name libcublas* ,发现在在libcublas 该有的都有:

/root/anaconda3/lib/libcublas.so.9.0
/root/anaconda3/lib/libcublas.so
/root/anaconda3/lib/libcublas.so.9.0.176
/root/anaconda3/pkgs/cudatoolkit-9.0-h13b8566_0/lib/libcublas.so.9.0
/root/anaconda3/pkgs/cudatoolkit-9.0-h13b8566_0/lib/libcublas.so
/root/anaconda3/pkgs/cudatoolkit-9.0-h13b8566_0/lib/libcublas.so.9.0.176
/usr/lib/x86_64-linux-gnu/libcublas_static.a
/usr/lib/x86_64-linux-gnu/libcublas.so.7.5.18
/usr/lib/x86_64-linux-gnu/stubs/libcublas.so
/usr/lib/x86_64-linux-gnu/libcublas_device.a
/usr/lib/x86_64-linux-gnu/libcublas.so
/usr/lib/x86_64-linux-gnu/libcublas.so.7.5
/usr/share/doc/libcublas7.5
/usr/share/lintian/overrides/libcublas7.5
/usr/share/man/man7/libcublas.so.7.gz
/usr/share/man/man7/libcublas.so.7
/usr/share/man/man7/libcublas.7.gz
/usr/share/man/man7/libcublas.7
/usr/local/cuda-9.2/doc/man/man7/libcublas.so.7
/usr/local/cuda-9.2/doc/man/man7/libcublas.7
/usr/local/cuda-9.2/lib64/libcublas_static.a
/usr/local/cuda-9.2/lib64/libcublas.so.9.2
/usr/local/cuda-9.2/lib64/stubs/libcublas.so
/usr/local/cuda-9.2/lib64/libcublas.so.9.2.148
/usr/local/cuda-9.2/lib64/libcublas_device.a
/usr/local/cuda-9.2/lib64/libcublas.so
/usr/local/cuda-9.0/doc/man/man7/libcublas.so.7
/usr/local/cuda-9.0/doc/man/man7/libcublas.7
/usr/local/cuda-9.0/lib64/libcublas_static.a
/usr/local/cuda-9.0/lib64/stubs/libcublas.so
/usr/local/cuda-9.0/lib64/libcublas.so.9.0
/usr/local/cuda-9.0/lib64/libcublas_device.a
/usr/local/cuda-9.0/lib64/libcublas.so
/usr/local/cuda-9.0/lib64/libcublas.so.9.0.176
/var/lib/dpkg/info.bak/libcublas7.5:amd64.shlibs
/var/lib/dpkg/info.bak/libcublas7.5:amd64.symbols
/var/lib/dpkg/info.bak/libcublas7.5:amd64.list
/var/lib/dpkg/info.bak/libcublas7.5:amd64.triggers
/var/lib/dpkg/info.bak/libcublas7.5:amd64.md5sums

于是思考,是不是环境没有正确加载到 libcublas* ,
参考这个链接 https://github.com/tensorflow/tensorflow/issues/15604
运行了这两条命令:
$bash -c "echo /usr/local/cuda/lib64/ >/etc/ld.so.conf.d/cuda.conf"
$ldconfig
然后试了一下
$python -c “import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))”
仍然报libcublas找不到的错误。

于是想了一下,我在anaconda下安装的,而这个env之前pip install tensorflow-gpu安装过一次,难道现在引入的是之前的 tensorflow ?
然后通过pip list 看了一下,果然有一个 tensorflow , 于是 source deactivate 之后 ,
运行了$python -c “import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))”,
发现在 anaconda之外是OK的,但版本是1.7?

所以,我再次激活了python3.5,通过pip uninstall tensorflow 卸载了之前的tf,又跑了一遍python -c 'import tensorflow as tf; print(tf.__version__)',发现正常了,但版本仍然是1.7。
表示现在1.7版本可用了,再用$python -c 'import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))'命令测了一下,现实8卡P100可用了。

2018-09-27 12:36:10.774651: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-09-27 12:36:43.384597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:88:00.0
totalMemory: 15.90GiB freeMemory: 360.88MiB
2018-09-27 12:36:43.643579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:8d:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:36:43.903969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 2 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:8e:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:36:44.170843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 3 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:8f:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:37:02.001743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 4 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b2:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:37:02.286188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 5 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b3:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:37:02.574992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 6 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:b5:00.0
totalMemory: 15.90GiB freeMemory: 510.88MiB
2018-09-27 12:37:02.578074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6

Thu Sep 27 12:36:42 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:88:00.0 Off |                  Off |
| N/A   27C    P0    30W / 250W |  15637MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:8D:00.0 Off |                  Off |
| N/A   27C    P0    29W / 250W |  15487MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:8E:00.0 Off |                  Off |
| N/A   29C    P0    29W / 250W |  15487MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:8F:00.0 Off |                  Off |
| N/A   26C    P0    29W / 250W |  15487MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-PCIE...  Off  | 00000000:B2:00.0 Off |                  Off |
| N/A   25C    P0    29W / 250W |  15487MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-PCIE...  Off  | 00000000:B3:00.0 Off |                  Off |
| N/A   29C    P0    31W / 250W |  15487MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-PCIE...  Off  | 00000000:B5:00.0 Off |                  Off |
| N/A   28C    P0    32W / 250W |  15487MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     63644      C   python                                     15625MiB |
|    1     63644      C   python                                     15475MiB |
|    2     63644      C   python                                     15475MiB |
|    3     63644      C   python                                     15475MiB |
|    4     63644      C   python                                     15475MiB |
|    5     63644      C   python                                     15475MiB |
|    6     63644      C   python                                     15475MiB |
+-----------------------------------------------------------------------------+

OK,基本装好。

tensorflow 1.8

现在回头解决1.7版本的问题。 在未激活 anaconda 状态下是,查看版本是1.7,anaconda激活后,也是1.7,所以怀疑
是在未激活状态下,默认用pip 安装了 1.7 tensorflow-gpu。为了验证,我激活了python3.5 conda,然后重新编译
tensorflow。

你可能感兴趣的:(工程,Tensorflow,工程,Tensorflow,cudnn,cuda,python3.5)