在使用tensorflow 的时候最头疼的问题就是跟cuda 之间的配套使用问题,加上Nvidia 新的 rtx 2080 ti 图灵架构目前官方声称只支持cuda-10, 以上版本,对于tensorflow 1.13.1 之下的版本是无法使用cuda-10的,很多项目是用老版本tensorflow 编写的,所以这有多蛋疼,用过的都懂。 今天介绍源码编译 tensorflow ,使低版本 tensorflow 也能用高版本 cuda, 解决高低搭配问题,其中编译的时候可以选 cuda 版本cudnn 版本,堪称万金油方法 。
这里涉及到目前在做的项目,目前刚入手一块RTX 2080 TI 想把代码迁移过来,发现有问题,目前2080 ti 的图灵架构只支持 CUDA-10以上版本,之前的代码都是跑在 CUDA-9.0 上。 耗时4天无数工程实践,加思考终于搞定,代码成功在RTX 2080 TI 上运行起来
Step 1 :更新系统,安装相关依赖项
sudo apt-get update
Step2 : 安装依赖库:
# for Python 2.7
$ sudo apt-get install python-numpypython-dev python-pip python-wheel
# for Python 3.x
$ sudo apt-get install python3-numpy python3-dev python3-pip python3-wheel
Step 3 : Download NCCL 2.4.8
Download NCCL v2.4.8, for CUDA 10.0 https://developer.nvidia.com/nccl
Choose Local installer for Ubuntu 16.04
Step4: Install NCCL 2.3 ()
tar -xvf nccl_2.4.8–1+cuda10.0_x86_64.txz
cd nccl_2.4.8–1+cuda10.0_x86_64/
sudo mkdir /usr/local/cuda-10.0/nccl
sudo cp -R * /usr/local/cuda-10.0/nccl
cd /usr/local/cuda-10.0/nccl
mv LICENSE.txt NCCL-SLA.txt
sudo ldconfig
图中的版本号改为自己所用的版本 为准
Step 5 :
Bazel 依赖于 JDK , 首先安装JDK
$ sudo apt-get install openjdk-8-jdk
安装bazel(apt):
$ echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
$ curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
$ sudo apt-get update && sudo apt-get install bazel
安装bazel(binary installer):
$ sudo apt-get install pkg-config zip g++ zlib1g-dev unzip python
Download bazel installer https://github.com/bazelbuild/bazel/releases
这里注意对于不同的tensorflow 版本要选择合适的bazel 版本, 低版本tensorflow 对应低版本 bazel , 高版本tensorflow对应高版 本bazel, bazel 版本选择不对,根本无法执行编译
$ chmod +x bazel--installer-linux-x86_64.sh
$ ./bazel--installer-linux-x86_64.sh --user
$ gedit ~/.bashrc
添加:export PATH="$PATH:$HOME/bin"
source ~./bashrc
sudo ldconfig
Step 6 : 下载tensorflow 源码:
$ git clone https://github.com/tensorflow/tensorflow
$ cd tensorflow*
$ ./configure
配置文件选择:
Give python path in
Please specify the location of python. [Default is /usr/bin/python]
/usr/bin/python3
Press enter two times
Do you wish to build TensorFlow with Apache Ignite support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
Do you wish to build TensorFlow with ROCm support? [y/N]: n
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 10.0
Please specify the location where CUDA 10.0 toolkit is installed. Refer to Home for more details. [Default is /usr/local/cuda]: /usr/local/cuda-10.0/
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.6.0
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: /usr/local/cuda-10.0/
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 2.2]: 2.4.8
Please specify the location where NCCL 2.3.5 is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: /usr/local/cuda-10.0/nccl/
Now we need compute capability which we have noted at step 1 eg. 5.0
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 5.0] 7.5
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: -march=native
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N
Configuration finished
Step 7: 编译 参考官网 https://www.tensorflow.org/install/source
GPU Support
bazel build --config=opt --config=cuda
//tensorflow/tools/pip_package:build_pip_package
这大概会持续一段时间,网上说要3-4小时,也有的说要6个小时,但是笔者亲测只用了1个半小时左右
bazel 会生成一个 叫 build_pip_package 脚本
然后生成 .whl 安装文件 使用命令
bazel-bin/tensorflow/tools/pip_package/build_pip_package tensorflow_pkg
这会生成一个新的文件夹 tensorflow_pkg 并在其中包含 .whl 的 安装文件
Step 8 : 安装 生成的 .whl 文件
cd tensorflow_pkg
sudo pip install *.whl
Step 9 : 验证安装是否正确
python
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
正常输出 hello, Tensorflow 就算安装正确
补充:
卸载 Bazel 的方法 , 该命令适用于 (apt)安装方法
$ sudo apt-get --purge remove bazel
$ sudo apt autoremove
对于binary installer 安装方法的卸载,笔者直接删除相关文件,并将添加到 ./bashrc 文件中的环境变量删除,然后选择合适的版本重新安装