Ubuntu 16.04 + CUDA toolkit 10.1 + cudNN7.6 + bazel 0.26.1
保证cuDNN跟CUDA SDK的版本一致即可.
从源码编译的话需要查看显卡的计算能力,可在查看GPU计算能力,在配置的时候填入即可,跟CUDA SDK的版本无关
▲必须注册账号后,才能选择下载
▲记住上述几个要点, 以及最好先看下文章最后我的踩坑记录,确保自己知道可能会有这些坑
▲.全程保持能连接外网(翻墙)
安装NVIDIA驱动
$ ubuntu-drivers devices # 查看
查看NVIDIA驱动版本
$ sudo dpkg --list | grep nvidia-*
显卡驱动与CUDA版本对应关系
最新的根据官网查看
下载NVIDIA驱动
▲注意:得对应匹配!!!
安装完成后,可以通过nvidia-smi
查看GPU使用情况
CUDA toolkit 10.1
安装之前先看显卡的驱动版本
# 安装过程
sudo apt update
$ sudo apt install cuda
设置环境变量
$ sudo vim ~/.bashrc
设置环境变量
PATH
LD_LIBRARY_PATH
生效
$ source ~/.bashrc
验证安装完成
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
./deviceQuery
△.如果显示Result:PASS才是安装成功。这一步一定要保证错误,不然之后bazel编译时报错就很难分析问题!!!!!(只有在驱动安装好的情况下才能运行成功,不然会报no CUDA-capable device is detected,或者GPU is lost)
如果失败,卸载重装
sudo ./uninstall_cuda_9.2.pl
安装 Python 和 TensorFlow 软件包依赖项
sudo apt install python-dev python-pip # or python3-dev python3-pip
安装 TensorFlow pip 软件包依赖项(如果使用虚拟环境,请省略 --user
参数):
pip install -U --user pip six numpy wheel setuptools mock future>=0.17.1
pip install -U --user keras_applications==1.0.6 --no-deps
pip install -U --user keras_preprocessing==1.0.5 --no-deps
cudNN 7.6.3
选择cuDNN Library for Linux
# 解压cudNN
cd /usr/local/cuda/
$ tar -xvf cudnn-8.0-linux-x64-v5.1.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
Bazel
构建Tensorflow需要用bazel,而wget下载很慢,需要代理,可以试着去github上下载后再传到服务器
后来./configure
配置TensorFlow时,提示Bazel版本太高...(这个问题出现在我下载的不是github下载下来的tensorflow,网盘下载的可能版本太低了)
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.29.1 installed.
Please downgrade your bazel installation to version 0.26.1 or lower to build TensorFlow! To downgrade: download the installer for the old version (from https://github.com/bazelbuild/bazel/releases) then run the installer.
从源码构建和安装Tensorflow
第一次安装的是2.0的版本
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
(tf) apollo3d@apollo1:~/Downloads/tensorflow-master$ ./configure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.26.0 installed.
Please specify the location of python. [Default is /home/apollo3d/tf/bin/python]: /usr/bin/python3
Found possible Python library paths:
/usr/local/lib/python3.5/dist-packages
/usr/lib/python3/dist-packages
Please input the desired Python library path to use. Default is [/usr/local/lib/python3.5/dist-packages]
Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.
Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.
Found CUDA 10.1 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 7 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]: 5.0
Do you want to use clang as CUDA compiler? [y/N]: N
Clang will be used as CUDA compiler.
Do you wish to download a fresh release of clang? (Experimental) [y/N]: N
Clang will be downloaded and used to compile tensorflow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: --config=v2
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v2 # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished
第二次1.14
(pythonEnv) apollo3d@apollo1:~/Downloads/tensorflow$ ./configure
WARNING: Running Bazel server needs to be killed, because the startup options are different.
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.26.0 installed.
Please specify the location of python. [Default is /home/apollo3d/pythonEnv/bin/python]:
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'site' has no attribute 'getsitepackages'
Found possible Python library paths:
/home/apollo3d/pythonEnv/lib/python3.5/site-packages
Please input the desired Python library path to use. Default is [/home/apollo3d/pythonEnv/lib/python3.5/site-packages]
Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.
Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.
Found CUDA 10.1 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 7 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 5.2]: 5.0
Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished
...
INFO: Elapsed time: 4826.834s, Critical Path: 298.31s
INFO: 24978 processes: 24978 local.
INFO: Build completed successfully, 26636 total actions
大概用了一个半小时
△.期间可能多次出现ERROR:[GET returned 404 Not Found, connect timed out]
,继续重试即可
编译 pip 软件包
bazel build -c --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
-c 不链接直接编译
Starting local Bazel server and connecting to it...
WARNING: The following configs were expanded more than once: [cuda_clang, using_cuda, download_clang_use_lld]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
WARNING: option '--crosstool_top' was expanded to from both option '--config=cuda_clang' (source /home/apollo3d/Downloads/tensorflow-master/.tf_configure.bazelrc) and option '--config=download_clang' (source /home/apollo3d/Downloads/tensorflow-master/.tf_configure.bazelrc)
$ bazel-bin/tensorfLow/tools/pip_package/build_pip_package ~/
tensorflow/bin # 传入一个表示Python whell文件存储路径的参数
在虚拟环境打开的情况下安装
附录:
windows上安装cuda时一直显示不能安装,就是系统默认安装的版本太低了,或者根本就没有安装。只有到显卡驱动下载安装了驱动后,才能正常安装CUDA
▲正确安装驱动真的很重要!!!
windows安装建议
安装完linux后,我在自己笔记本上又安装了个Windows10的,两个是相通的,windows的很快我就安装好了。主要卡壳的时间全都是浪费在了驱动版本安装错误,只要把驱动的版本弄对了。然后依次安装NVIDIA驱动、CUDA、Cudnn就可以了。
采坑记录
▲下载tensorflow,一定要从官网下载最新的。千万不要去网盘上下载,我就是被这个坑了很久。最后从github下了后才逐渐正常。(github下载会很慢,可以见我的另外一篇文章,如何提速)
▲./configure
的配置也很重要,除了CUDA选项其他都选默认
Linux系统下安装TensorFlow的GPU版本
CUDA、显卡驱动和Tensorflow版本之间的对应关系
Linux x86_64 Driver Version与CUDA Toolkit的对应,一定要对应好,CUDA超出了Driver Version的话,是无法使用的^提示
==>CUDA 10.1的需要Linux x86_64 Driver Version>=410.48。同时,tensorflow-gpu默认安装的是1.14版本,要求的cudNN为7,CUDA为9
▲google.xxxx.xxxx==>unknown hosts
报了这个错以后才发现,服务器的DNS没配置(云服务器一般不会出现这样的问题)
$ sudo vim /etc/resolv.conf
Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 8.8.8.8
nameserver 8.8.4.4
▲An error occurred during the fetch of repository 'llvm':Error 404 , cant connect
多次重试即可
bazel build -c --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
▲报错 AttributeError: '_NamespacePath' object has no attribute 'sort'
重新安装setuptools
(pythonEnv) pip install setuptools
(pythonEnv) $ pip install googleapis-common-protos
执行上述命令需要相当长的一段时间,具体时长取决于你的计算机性能。
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 77.892s, Critical Path: 77.18s
INFO: 45 processes: 45 local.
INFO: Build completed successfully, 46 total actions
待Bazel成功完成上述任务后会输出,运行输出的可执行程序,并传入一个表示Python whee文件存储路径的参数:
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/
(pythonEnv) apollo3d@apollo1:~/Downloads/tensorflow$ bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/
2019年 09月 24日 星期二 14:34:05 CST : === Preparing sources in dir: /tmp/tmp.pguTqrHuLA
~/Downloads/tensorflow ~/Downloads/tensorflow
~/Downloads/tensorflow
/tmp/tmp.pguTqrHuLA/tensorflow/include ~/Downloads/tensorflow
~/Downloads/tensorflow
2019年 09月 24日 星期二 14:34:11 CST : === Building wheel
warning: no files found matching '.pyd' under directory ''
warning: no files found matching '.pd' under directory ''
warning: no files found matching '.dylib' under directory ''
warning: no files found matching '.dll' under directory ''
warning: no files found matching '.lib' under directory ''
warning: no files found matching '.csv' under directory ''
warning: no files found matching '.h' under directory 'tensorflow_core/include/tensorflow'
warning: no files found matching '' under directory 'tensorflow_core/include/third_party'
2019年 09月 24日 星期二 14:34:33 CST : === Output wheel file is in: /home/apollo3d/
pip安装
上述命令将在~/tensorlowbn下创建一个Python.wh文件。请确保你的“tensor-fow Virtualenv不境处于活动状态,然后用pip安装该whee文件(请注意该二进制文件的具体名称会依所安装的TensorFlow版本、所使用的操作系统和Python版本而不同):
(pythonEnv) apollo3d@apollo1:~$ pip install tensorflow-2.0.0rc2-cp35-cp35m-linux_x86_64.whl
▲安装成功后,使用时报错
使用TensorFlow时报错FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy.......
报错原因:numpy1-17-0版本过高,使用numpy-1.16-0版本即可
解决方法:重新安装numpy-1.16-0
pip install numpy==1.16.0