这两天为了在Deepin Linux上安装tensorflow-gpu版真的是一把辛酸一把泪,好在经过无数次踩坑终于完美解决。以下是我的计算机的配置:
本文安装过程中使用的相关构建配置如下:
注意:本文适用于64位的计算机(现在基本都是),安装的是Tensorflow-gpu 1.8.0版,其他版本推荐使用如下图所示的官方经测试的构建配置,安装方法不变。从图中可见,不同tf版本其实也就构建工具版本不同,所以其他工具版本和我保持一致就行。
在文章的最后会附上我踩过的坑,供大家参考。在15.9的系统和Intel、Nvidia双显卡的环境下,按照本教程应该是可以安装成功的,一步一步做就好。
方法概述: 使用大黄蜂方案(平时用开源,需要时手动切换Nvidia驱动独显,稳定省电),安装源内版本的CUDA,手动安装cuDNN和NCCL,使用源代码编译安装Tensorflow-gpu(因为源内CUDA只有9.1,tf官方只有支持9.0的,所以只能自己编译源代码安装)。
推荐安装Anaconda,附带各种方便科研和开发的库的Python发行版,我安装的是Anaconda 3.6。怎么安装Anaconda请看这里。已经安装Python的可略过。
使用系统自带的“显卡驱动管理器”,傻瓜切换到大黄蜂解决方案。
切换成功后,在命令前加上optirun
即可使用独立显卡运行程序。在终端中输入
optirun -b none nvidia-settings -c :8
即可打开NVIDIA X服务器设置,查看显卡及驱动信息,可见我安装上的驱动版本为390.67。CUDA 9.0要求显卡驱动384.xx以上,满足要求。
根据我的尝试,自己安装CUDA 9会不识别显卡驱动,所以此处采用源内CUDA。安装很简单:
sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit libcupti-dev nvidia-nsight nvidia-visual-profiler
一条命令搞定确实是很方便,安装后使用nvcc --version
命令即可看到CUDA版本信息为9.1。
安装是容易,但是apt安装的CUDA是分散在/usr文件夹中各处的,而之后安装cudnn、编译安装Tensorflow,是需要指定CUDA的位置的,所以必须用软链接将CUDA的文件集合到一个文件夹内(此处主要参考这篇文章):
sudo mkdir -p /usr/local/cuda /usr/local/cuda/extras/CUPTI /usr/local/cuda/nvvm
sudo ln -s /usr/bin /usr/local/cuda/bin
sudo ln -s /usr/include /usr/local/cuda/include
sudo ln -s /usr/lib/x86_64-linux-gnu /usr/local/cuda/lib64
sudo ln -s /usr/local/cuda/lib64 /usr/local/cuda/lib
sudo ln -s /usr/include /usr/local/cuda/extras/CUPTI/include
sudo ln -s /usr/lib/x86_64-linux-gnu /usr/local/cuda/extras/CUPTI/lib64
sudo ln -s /usr/lib/nvidia-cuda-toolkit/libdevice /usr/local/cuda/nvvm/libdevice
以上,将CUDA的文件集合到了/usr/local/cuda文件夹内。
Tensorflow还依赖cuDNN和NCCL,这两个都是Nvidia针对机器学习深度学习的库。
前往这里下载cuDNN,下载的时候要注册登录账号。我下载的是7.1.3版本,经我测试是没有问题的,记得下载for CUDA 9.1的。
下载完之后解压,终端切换到解压出的文件夹中,依次输入以下命令将cuDNN安装到CUDA的安装目录:
sudo cp include/* /usr/local/cuda/include/
sudo cp lib64/libcudnn.so.7.1.3 lib64/libcudnn_static.a /usr/local/cuda/lib64/
cd /usr/lib/x86_64-linux-gnu
sudo ln -s libcudnn.so.7.1.3 libcudnn.so.7
sudo ln -s libcudnn.so.7 libcudnn.so
接下来安装NCCL,从这里下载。我用的是NCCL 2.1.15 for CUDA 9.1,下载NCCL 2.1.15 O/S agnostic and CUDA 9。同样,下载后解压,终端切换到解压出的文件夹,依次输入以下命令:
sudo mkdir -p /usr/local/cuda/nccl/lib /usr/local/cuda/nccl/include
sudo cp *.txt /usr/local/cuda/nccl
sudo cp include/*.h /usr/include/
sudo cp lib/libnccl.so.2.1.15 lib/libnccl_static.a /usr/lib/x86_64-linux-gnu/
sudo ln -s /usr/include/nccl.h /usr/local/cuda/nccl/include/nccl.h
cd /usr/lib/x86_64-linux-gnu
sudo ln -s libnccl.so.2.1.15 libnccl.so.2
sudo ln -s libnccl.so.2 libnccl.so
for i in libnccl*; do sudo ln -s /usr/lib/x86_64-linux-gnu/$i /usr/local/cuda/nccl/lib/$i; done
安装Java环境和gcc-4.8:
sudo apt install openjdk-8-jdk gcc-4.8
接下来安装用于构建的工具Bazel。首先安装依赖的工具:
sudo apt-get install pkg-config zip g++ zlib1g-dev unzip
然后点击这里下载Bazel,记得下载针对linux x86_64的sh格式的安装包。本文安装的是Tensorflow 1.8.0,使用的是0.10.0 for Linux x86_64,其他Tensorflow版本请使用本文第一幅图所示的Bazel版本。安装:
sudo chmod +x bazel安装包文件名.sh
./bazel安装包文件名.sh --user
将Bazel目录加入环境变量:
sudu dedit ~/.bashrc
并把以下内容加在打开的文件末尾:
export PATH="$PATH:$HOME/bin"
你可能要问为什么要从源代码编译安装,这是因为源内只有CUDA 9.1版本,而直接用pip安装只支持最高9.0版本的CUDA,所以只能从源代码编译安装。
下面我们开始吧!从这里下载你要安装的Tensorflow版本的源代码。下载后解压,终端切换到解压后的文件夹,输入./configure
配置Tensorflow编译安装的相关参数,这个过程中会询问你一大堆问题。以下是我配置的示例,注释中标“@”的需要特别注意,其他根据自己的需要选择,不懂的默认就好:
./configure
You have bazel 0.10.0 installed.
# @ 指定你的Python位置,可以通过which python命令查看
Please specify the location of python. [Default is /opt/anaconda3/bin/python]: /opt/anaconda3/bin/python
# @ 设定你的Python各种库的位置
Found possible Python library paths:
/opt/anaconda3/lib/python3.6/site-packages
Please input the desired Python library path to use. Default is [/opt/anaconda3/lib/python3.6/site-packages]
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]:
jemalloc as malloc support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]:
Google Cloud Platform support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]:
Hadoop File System support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Amazon AWS Platform support? [Y/n]:
Amazon AWS Platform support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]:
Apache Kafka Platform support will be enabled for TensorFlow.
Do you wish to build TensorFlow with XLA JIT support? [y/N]:
No XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with GDR support? [y/N]:
No GDR support will be enabled for TensorFlow.
Do you wish to build TensorFlow with VERBS support? [y/N]:
No VERBS support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.
# @ 是否要有CUDA支持,肯定要选Y,编译出来的才是gpu版
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
CUDA support will be enabled for TensorFlow.
# @ 选择CUDA版本,这里填9.1
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 9.1
# @ 指定CUDA的安装位置,根据前面我们做好的软链接,这里填/usr/local/cuda,默认就是
Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
# @ 指定cuDNN版本,我下的是7.1.3
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.1.3
# @ 指定cuDNN安装位置,根据前面我们做好的软链接,这里填/usr/local/cuda,默认就是
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.
# @ 指定NCCL版本,我安装的是2.1.15
Please specify the NCCL version you want to use. If NCLL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 1.3]: 2.1.15
# @ 指定NCCL位置,不能用默认的,要填/usr/local/cuda/nccl
Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/cuda/nccl
# @ 指定要编译的显卡CUDA计算能力,这个根据自己的显卡计算能力和需要进行编译,可以有多个,用逗号隔开
# 显卡计算能力可以从这里查到:https://developer.nvidia.com/cuda-gpus
# 我的是渣卡,只有3.0(编译安装最低需要3.0),请见谅-_-!!
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your
build time and binary size. [Default is: 3.5,7.0] 3.0
# @ 这里选择默认的N
Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.
# @ 指定gcc位置,这里使用我们已经安装的gcc 4.8版本
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc-4.8
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.
# @ 这里建议保持默认,即为编译所使用的这个显卡优化性能
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
Configuration finished
配置完后,先确保需要用到的Python库都装了:
pip install -U --user pip six numpy wheel mock
pip install -U --user keras_applications==1.0.5 --no-deps
pip install -U --user keras_preprocessing==1.0.3 --no-deps
然后就开始正式编译Tensorflow-gpu:
optirun bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
另外,还可以加个--local_resources
标签来指定编译所用的系统资源,如--local_resources 4096,0.5,1.0
表示最大将4096MB内存、一半的CPU资源、全部的IO资源用于Tensorflow编译。
这个编译要蛮久,得等几个小时,这个时候不如去吃个饭喝个茶看个电影@-@。
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install --user /tmp/tensorflow_pkg/tensorflow*
以上,已经安装好了Tensorflow-gpu,在终端输入
optirun python
>>>import tensorflow as tf
看看是不是已经可以成功导入Tensorflow了?对了,以后要用到Tensorflow-gpu来运行Python程序,别忘了在运行命令前加“optirun”。
这里把我这几天探索过程中踩的坑放出来,给大家一个参考: