拒绝踩坑!源码编译 tensorflow 解决 cuda 不配套 万金油方法

在使用tensorflow 的时候最头疼的问题就是跟cuda 之间的配套使用问题,加上Nvidia 新的 rtx 2080 ti 图灵架构目前官方声称只支持cuda-10, 以上版本,对于tensorflow 1.13.1 之下的版本是无法使用cuda-10的,很多项目是用老版本tensorflow 编写的,所以这有多蛋疼,用过的都懂。 今天介绍源码编译 tensorflow ,使低版本 tensorflow 也能用高版本 cuda, 解决高低搭配问题,其中编译的时候可以选 cuda 版本cudnn 版本,堪称万金油方法 。 

 这里涉及到目前在做的项目,目前刚入手一块RTX 2080 TI 想把代码迁移过来,发现有问题,目前2080 ti 的图灵架构只支持     CUDA-10以上版本,之前的代码都是跑在 CUDA-9.0 上。 耗时4天无数工程实践,加思考终于搞定,代码成功在RTX 2080 TI 上运行起来

 

Step 1 :更新系统,安装相关依赖项

sudo apt-get update

 Step2 :  安装依赖库:

# for Python 2.7

$ sudo apt-get install python-numpypython-dev python-pip python-wheel

# for Python 3.x

$ sudo apt-get install python3-numpy python3-dev python3-pip python3-wheel

 

  Step 3 :  Download NCCL 2.4.8 

  Download NCCL v2.4.8, for CUDA 10.0   https://developer.nvidia.com/nccl
  Choose Local installer for Ubuntu 16.04 

 

   Step4:  Install NCCL 2.3  ()

tar -xvf nccl_2.4.8–1+cuda10.0_x86_64.txz
cd nccl_2.4.8–1+cuda10.0_x86_64/
sudo mkdir /usr/local/cuda-10.0/nccl
sudo cp -R * /usr/local/cuda-10.0/nccl
cd /usr/local/cuda-10.0/nccl
mv LICENSE.txt NCCL-SLA.txt
sudo ldconfig

  图中的版本号改为自己所用的版本 为准

 

  Step 5 :  

  Bazel 依赖于 JDK , 首先安装JDK

$ sudo apt-get install openjdk-8-jdk

  安装bazel(apt):

$ echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" |  sudo tee /etc/apt/sources.list.d/bazel.list

$ curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -

$ sudo apt-get update && sudo apt-get install bazel

  安装bazel(binary installer):

$ sudo apt-get install pkg-config zip g++ zlib1g-dev unzip python


  Download bazel installer  https://github.com/bazelbuild/bazel/releases

  这里注意对于不同的tensorflow 版本要选择合适的bazel 版本, 低版本tensorflow 对应低版本 bazel , 高版本tensorflow对应高版     本bazel,  bazel 版本选择不对,根本无法执行编译 

$ chmod +x bazel--installer-linux-x86_64.sh

$ ./bazel--installer-linux-x86_64.sh --user

$ gedit ~/.bashrc

添加:export PATH="$PATH:$HOME/bin"

source ~./bashrc 

sudo ldconfig

 Step 6 : 下载tensorflow 源码: 

$ git clone https://github.com/tensorflow/tensorflow

$ cd tensorflow*

$ ./configure

配置文件选择:

Give python path in



Please specify the location of python. [Default is /usr/bin/python]

/usr/bin/python3


Press enter two times



Do you wish to build TensorFlow with Apache Ignite support? [Y/n]: n



Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n



Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n



Do you wish to build TensorFlow with ROCm support? [y/N]: n



Do you wish to build TensorFlow with CUDA support? [y/N]: Y



Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 10.0



Please specify the location where CUDA 10.0 toolkit is installed. Refer to Home for more details. [Default is /usr/local/cuda]: /usr/local/cuda-10.0/



Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.6.0



Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: /usr/local/cuda-10.0/



Do you wish to build TensorFlow with TensorRT support? [y/N]: N



Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 2.2]: 2.4.8



Please specify the location where NCCL 2.3.5 is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: /usr/local/cuda-10.0/nccl/



Now we need compute capability which we have noted at step 1 eg. 5.0



Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 5.0] 7.5



Do you want to use clang as CUDA compiler? [y/N]: N



Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc



Do you wish to build TensorFlow with MPI support? [y/N]: N



Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: -march=native



Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N



Configuration finished

 

 Step 7: 编译    参考官网  https://www.tensorflow.org/install/source

 GPU Support 

bazel build --config=opt --config=cuda
//tensorflow/tools/pip_package:build_pip_package

 这大概会持续一段时间,网上说要3-4小时,也有的说要6个小时,但是笔者亲测只用了1个半小时左右

 bazel 会生成一个 叫 build_pip_package 脚本 

 然后生成 .whl 安装文件 使用命令

bazel-bin/tensorflow/tools/pip_package/build_pip_package tensorflow_pkg

 这会生成一个新的文件夹 tensorflow_pkg 并在其中包含 .whl 的 安装文件 

 

 Step 8 :  安装 生成的 .whl 文件 

cd tensorflow_pkg



sudo pip install *.whl

 

 Step 9 : 验证安装是否正确

python



import tensorflow as tf

hello = tf.constant('Hello, TensorFlow!')

sess = tf.Session()

print(sess.run(hello))

  正常输出 hello, Tensorflow 就算安装正确 

 

 

补充:

卸载 Bazel 的方法 , 该命令适用于 (apt)安装方法

$ sudo apt-get --purge remove bazel

$ sudo apt autoremove

对于binary installer 安装方法的卸载,笔者直接删除相关文件,并将添加到 ./bashrc 文件中的环境变量删除,然后选择合适的版本重新安装 

 

 

你可能感兴趣的:(方法,兼容性环境配置)