Ubuntu 16.04 + CUDA toolkit 10.1 + cudNN7.6 + bazel 0.26.1及windows

Ubuntu 16.04 + CUDA toolkit 10.1 + cudNN7.6 + bazel 0.26.1

保证cuDNN跟CUDA SDK的版本一致即可.

从源码编译的话需要查看显卡的计算能力,可在查看GPU计算能力,在配置的时候填入即可,跟CUDA SDK的版本无关

▲必须注册账号后,才能选择下载

▲记住上述几个要点, 以及最好先看下文章最后我的踩坑记录,确保自己知道可能会有这些坑

▲.全程保持能连接外网(翻墙)

安装NVIDIA驱动

 $ ubuntu-drivers devices # 查看

查看NVIDIA驱动版本

$ sudo dpkg --list | grep nvidia-*

显卡驱动与CUDA版本对应关系

最新的根据官网查看

image

下载NVIDIA驱动

▲注意:得对应匹配!!!

安装完成后,可以通过nvidia-smi查看GPU使用情况

CUDA toolkit 10.1

安装之前先看显卡的驱动版本

 # 安装过程
sudo apt update
$ sudo apt install cuda

设置环境变量

 $ sudo vim ~/.bashrc

设置环境变量

PATH
LD_LIBRARY_PATH

生效

$ source ~/.bashrc

验证安装完成

  cd /usr/local/cuda/samples/1_Utilities/deviceQuery
./deviceQuery

△.如果显示Result:PASS才是安装成功。这一步一定要保证错误,不然之后bazel编译时报错就很难分析问题!!!!!(只有在驱动安装好的情况下才能运行成功,不然会报no CUDA-capable device is detected,或者GPU is lost)

如果失败,卸载重装

  sudo ./uninstall_cuda_9.2.pl

安装 Python 和 TensorFlow 软件包依赖项

 sudo apt install python-dev python-pip # or python3-dev python3-pip

安装 TensorFlow pip 软件包依赖项(如果使用虚拟环境,请省略 --user 参数):

 pip install -U --user pip six numpy wheel setuptools mock future>=0.17.1
pip install -U --user keras_applications==1.0.6 --no-deps
pip install -U --user keras_preprocessing==1.0.5 --no-deps

cudNN 7.6.3

选择cuDNN Library for Linux

 # 解压cudNN
cd /usr/local/cuda/
$ tar -xvf cudnn-8.0-linux-x64-v5.1.tgz

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Bazel

构建Tensorflow需要用bazel,而wget下载很慢,需要代理,可以试着去github上下载后再传到服务器

后来./configure配置TensorFlow时,提示Bazel版本太高...(这个问题出现在我下载的不是github下载下来的tensorflow,网盘下载的可能版本太低了)

 WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.29.1 installed.
Please downgrade your bazel installation to version 0.26.1 or lower to build TensorFlow! To downgrade: download the installer for the old version (from https://github.com/bazelbuild/bazel/releases) then run the installer.

从源码构建和安装Tensorflow

第一次安装的是2.0的版本

 git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
(tf) apollo3d@apollo1:~/Downloads/tensorflow-master$ ./configure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.26.0 installed.
Please specify the location of python. [Default is /home/apollo3d/tf/bin/python]: /usr/bin/python3


Found possible Python library paths:
/usr/local/lib/python3.5/dist-packages
/usr/lib/python3/dist-packages
Please input the desired Python library path to use. Default is [/usr/local/lib/python3.5/dist-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Found CUDA 10.1 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 7 in:
/usr/local/cuda/lib64
/usr/local/cuda/include


Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]: 5.0


Do you want to use clang as CUDA compiler? [y/N]: N
Clang will be used as CUDA compiler.

Do you wish to download a fresh release of clang? (Experimental) [y/N]: N
Clang will be downloaded and used to compile tensorflow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: --config=v2


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v2 # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

第二次1.14

 (pythonEnv) apollo3d@apollo1:~/Downloads/tensorflow$ ./configure
WARNING: Running Bazel server needs to be killed, because the startup options are different.
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.26.0 installed.
Please specify the location of python. [Default is /home/apollo3d/pythonEnv/bin/python]:


Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'site' has no attribute 'getsitepackages'
Found possible Python library paths:
/home/apollo3d/pythonEnv/lib/python3.5/site-packages
Please input the desired Python library path to use. Default is [/home/apollo3d/pythonEnv/lib/python3.5/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Found CUDA 10.1 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 7 in:
/usr/local/cuda/lib64
/usr/local/cuda/include


Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 5.2]: 5.0


Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:


Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

...

INFO: Elapsed time: 4826.834s, Critical Path: 298.31s
INFO: 24978 processes: 24978 local.
INFO: Build completed successfully, 26636 total actions

大概用了一个半小时

△.期间可能多次出现ERROR:[GET returned 404 Not Found, connect timed out],继续重试即可

编译 pip 软件包

  bazel build -c --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

-c 不链接直接编译

Starting local Bazel server and connecting to it...
WARNING: The following configs were expanded more than once: [cuda_clang, using_cuda, download_clang_use_lld]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
WARNING: option '--crosstool_top' was expanded to from both option '--config=cuda_clang' (source /home/apollo3d/Downloads/tensorflow-master/.tf_configure.bazelrc) and option '--config=download_clang' (source /home/apollo3d/Downloads/tensorflow-master/.tf_configure.bazelrc)

$ bazel-bin/tensorfLow/tools/pip_package/build_pip_package ~/
tensorflow/bin # 传入一个表示Python whell文件存储路径的参数

在虚拟环境打开的情况下安装

附录:

windows上安装cuda时一直显示不能安装,就是系统默认安装的版本太低了,或者根本就没有安装。只有到显卡驱动下载安装了驱动后,才能正常安装CUDA

▲正确安装驱动真的很重要!!!

windows安装建议

安装完linux后,我在自己笔记本上又安装了个Windows10的,两个是相通的,windows的很快我就安装好了。主要卡壳的时间全都是浪费在了驱动版本安装错误,只要把驱动的版本弄对了。然后依次安装NVIDIA驱动、CUDA、Cudnn就可以了。

采坑记录

▲下载tensorflow,一定要从官网下载最新的。千万不要去网盘上下载,我就是被这个坑了很久。最后从github下了后才逐渐正常。(github下载会很慢,可以见我的另外一篇文章,如何提速)

./configure的配置也很重要,除了CUDA选项其他都选默认

Linux系统下安装TensorFlow的GPU版本

CUDA、显卡驱动和Tensorflow版本之间的对应关系

Linux x86_64 Driver Version与CUDA Toolkit的对应,一定要对应好,CUDA超出了Driver Version的话,是无法使用的^提示

==>CUDA 10.1的需要Linux x86_64 Driver Version>=410.48。同时,tensorflow-gpu默认安装的是1.14版本,要求的cudNN为7,CUDA为9

google.xxxx.xxxx==>unknown hosts

报了这个错以后才发现,服务器的DNS没配置(云服务器一般不会出现这样的问题)

 $ sudo vim /etc/resolv.conf

Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN

nameserver 8.8.8.8
nameserver 8.8.4.4

▲An error occurred during the fetch of repository 'llvm':Error 404 , cant connect

多次重试即可

  bazel build -c --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

▲报错 AttributeError: '_NamespacePath' object has no attribute 'sort'

重新安装setuptools

 (pythonEnv)  pip install setuptools
(pythonEnv) $ pip install googleapis-common-protos

执行上述命令需要相当长的一段时间,具体时长取决于你的计算机性能。

 Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 77.892s, Critical Path: 77.18s
INFO: 45 processes: 45 local.
INFO: Build completed successfully, 46 total actions

待Bazel成功完成上述任务后会输出,运行输出的可执行程序,并传入一个表示Python whee文件存储路径的参数:

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/

 (pythonEnv) apollo3d@apollo1:~/Downloads/tensorflow$ bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/
2019年 09月 24日 星期二 14:34:05 CST : === Preparing sources in dir: /tmp/tmp.pguTqrHuLA
~/Downloads/tensorflow ~/Downloads/tensorflow
~/Downloads/tensorflow
/tmp/tmp.pguTqrHuLA/tensorflow/include ~/Downloads/tensorflow
~/Downloads/tensorflow
2019年 09月 24日 星期二 14:34:11 CST : === Building wheel
warning: no files found matching '.pyd' under directory ''
warning: no files found matching '.pd' under directory ''
warning: no files found matching '.dylib' under directory ''
warning: no files found matching '.dll' under directory ''
warning: no files found matching '.lib' under directory ''
warning: no files found matching '.csv' under directory ''
warning: no files found matching '.h' under directory 'tensorflow_core/include/tensorflow'
warning: no files found matching '
' under directory 'tensorflow_core/include/third_party'
2019年 09月 24日 星期二 14:34:33 CST : === Output wheel file is in: /home/apollo3d/

pip安装

上述命令将在~/tensorlowbn下创建一个Python.wh文件。请确保你的“tensor-fow Virtualenv不境处于活动状态,然后用pip安装该whee文件(请注意该二进制文件的具体名称会依所安装的TensorFlow版本、所使用的操作系统和Python版本而不同):

(pythonEnv) apollo3d@apollo1:~$ pip install tensorflow-2.0.0rc2-cp35-cp35m-linux_x86_64.whl

▲安装成功后,使用时报错

使用TensorFlow时报错FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy.......

报错原因:numpy1-17-0版本过高,使用numpy-1.16-0版本即可

解决方法:重新安装numpy-1.16-0

  pip install numpy==1.16.0

你可能感兴趣的:(Ubuntu 16.04 + CUDA toolkit 10.1 + cudNN7.6 + bazel 0.26.1及windows)