Ubuntu16.04 基于NVIDIA 1080Ti安装TensorFlow-GPU

环境

  • 系统:ubuntu 16.04.5 desktop
  • 显卡:NVIDIA GeForce GTX 1080 Ti
  • Cuda:9.0
  • CUDNN:7.3
  • tensorflow-gpu:1.10
    官方文档:https://www.tensorflow.org/install/install_linux

1 安装Ubuntu 16.04.5系统基本配置,及远程桌面

参考:
Ubuntu16.04.5 desktop 基本配置及远程桌面

2 安装NVIDIA驱动程序(重要)

2.1 方法一:桌面(desktop)安装

想要用GPU版的MxNet必须用NVIDIA的GPU,如果没有禁用Ubuntu自带的显卡驱动,更新Nvdia的驱动,就会出现如X server is running或者不停的提示你重启,
或者即使你安装成功了,也没办连接驱动等各种问题。

桌面版的Ubuntu,就有一个最简单的方式。在“软件和更新”里,有“附加驱动”这一选项,系统会自动检测到NVIDIA官方的显卡驱动,只要选中安装然后重启即可!
安装完,查看显卡驱动信息
Ubuntu16.04 基于NVIDIA 1080Ti安装TensorFlow-GPU_第1张图片

user@gpu:~$ nvidia-smi 
Sat Sep 22 17:50:29 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0  On |                  N/A |
|  0%   44C    P8    14W / 300W |    249MiB / 11170MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1184      G   /usr/lib/xorg/Xorg                           126MiB |
|    0      1773      G   compiz                                       120MiB |
+-----------------------------------------------------------------------------+
user@gpu:~$ 

要求驱动版本>=384.81

2.2 方法二:server版安装

2.2.1 驱动下载

下载:官方
选择自己的驱动型号,系统版本,语言
我的版本为:

类型 型号
产品类型 GeForce
产品系列 GeForce 10 Servers
产品家族 GeForce CTX 1080 Ti
操作系统 Linux 64-bit
语言 English

我下载的文件为:NVIDIA-Linux-x86_64-390.87.run

2.2.2 安装

user@gpu:~$ mkdir ~/driver
user@gpu:~$ cd ~/driver
user@gpu:driver$ sudo chmod +x NVIDIA-Linux-x86_64-390.87.run
user@gpu:driver$ sudo sh NVIDIA-Linux-x86_64-390.87.run

安装第一部会提示协议条款,accept即可;之后按照提示进行安装,中间会提示警告32-bit文件无法安装,忽略即可,接着下一步;接下来根据提示一步一步安装即可…

如果安装nvidia显卡驱动脚本时报如下错误:

ERROR: You appear to be running an X server; please exit X before 
installing. For further details, please see the section INSTALLING 
THE NVIDIA DRIVER in the README available on the Linux driver 
download page at www.nvidia.com.

通常停止显示管理器就足以阻止X

sudo systemctl stop lightdm.service

更普遍的方法

sudo systemctl stop display-manager

安装完成后,重启:

user@gpu:driver$ sudo reboot

安装完,查看显卡驱动信息

user@gpu:driver$ nvidia-smi 

卸载方法

user@gpu:driver$ sudo sh NVIDIA-Linux-x86_64-390.87.run --uninstall

2.3 方法三 禁用Ubuntu自带显卡驱动

删除Nouveau内核驱动程序(修复Nvidia安装错误)
参考:https://tutorials.technology/tutorials/85-How-to-remove-Nouveau-kernel-driver-Nvidia-install-error.html

警告本教程可能会破坏您的系统,请确保在执行这些步骤之前备份系统。

如果当前正在使用Nouveau内核驱动程序,则安装Offial nvidia驱动程序将返回错误。我们将解释如何修复错误并安装官方驱动程序。

ERROR: The Nouveau kernel driver is currently in use by your system.  
This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding.  
Please consult the NVIDIA driver README and
your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.

2.3.1 清理所有nvidia包

在此步骤中,我们将删除所有与nvidia相关的包。

user@gpu:~$ sudo apt-get remove nvidia* && sudo apt autoremove

如果您收到以下错误,则表示您从未安装过nvidia软件包并且没问题:

no matches found: nvidia*

现在安装一些必需的依赖项:

user@gpu:~$ sudo apt-get install dkms build-essential linux-headers-generic

2.3.2 黑名单nouveau驱动程序

现在阻止并禁用nouveau内核驱动程序:

user@gpu:~$ sudo vim /etc/modprobe.d/blacklist.conf
#添加

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

2.3.3 更新initramfs

键入以下命令禁用内核nouveau:

user@gpu:~$ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf build the new kernel by:

最后更新并重启:

user@gpu:~$ sudo update-initramfs -u
user@gpu:~$ reboot

3 安装Nvidia cuda_9.0

3.1 修改Ubuntu的默认启动级别为3

为防止安装cuda时报如下错误,修改Ubuntu的默认启动级别为3。

Installing the NVIDIA display driver...
It appears that an X server is running. Please exit X before installation. 
If you're sure that X is not running, but are getting this error, please delete any X lock files in /tmp.

3.1.1 查看系统目前运行级别

user@gpu:~$ runlevel
N 5

3.1.2 修改运行级别为3

命令行模式和图形界面模式的切换

命令行 --> 图形界面:

现在如果想进入图形用户界面(仅进入一次,重启系统后仍然会进入命令行模式),可执行如下命令:

user@gpu:~$ sudo systemctl start lightdm

如果想设置为系统启动后默认进入图形用户界面,执行如下命令:

user@gpu:~$ sudo systemctl set-default graphical.target

然后执行reboot命令重启系统即可。

user@gpu:~$ sudo reboot
图形界面 --> 命令行:

设置为系统启动后默认进入命令行,执行如下命令:

user@gpu:~$ sudo systemctl set-default multi-user.target

然后执行reboot命令重启系统即可。

user@gpu:~$ sudo reboot

3.1.3 验证

user@gpu:~$ runlevel
N 3
user@gpu:~$

3.2 下载Nvidia cuda_9.0

3.2.1 下载地址

最新版:
https://developer.nvidia.com/cuda-downloads

存档版:
https://developer.nvidia.com/cuda-toolkit-archive

user@gpu:/data/tools$ ll
总用量 1952872
drwxr-xr-x 3 user user        269 9月  14 13:25 ./
drwxr-xr-x 3 user user         19 9月  14 10:21 ../
-rw-rw-r-- 1 user user 1643293725 9月  22 16:35 cuda_9.0.176_384.81_linux.run

上面安装的NviDia驱动版本是384.130,此程序包驱动版本为384.81。

3.2.2 安装依赖包libGLU.so + libX11.so + libXi.so + libXmu.so

user@gpu:~$ sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

否则安装cuda会报如下错误

Installing the NVIDIA display driver...
Installing the CUDA Toolkit in /usr/local/cuda-9.0 ...
Missing recommended library: libGLU.so
Missing recommended library: libX11.so
Missing recommended library: libXi.so
Missing recommended library: libXmu.so

3.3 安装Nvidia cuda_9.0驱动

user@gpu:/data/tools$ sudo sh cuda_9.0.176_384.81_linux.run
......
# 空格键阅读协议
......
Do you accept the previously read EULA?
accept/decline/quit: accept             # 同意协议

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit: y                    # 安装NVIDIA加速图形驱动程序

Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]: n # 不安装OpenGL库

Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]:    # 默认不安装nvidia-xconfig

Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y                    # 安装CUDA 9.0 Toolkit

Enter Toolkit Location
 [ default is /usr/local/cuda-9.0 ]:    # cuda安装位置

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y                    # 安装符号链接

Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit: y                    # 安装CUDA示例

Enter CUDA Samples Location
 [ default is /home/user ]:             # CUDA示例位置

Installing the NVIDIA display driver...
Installing the CUDA Toolkit in /usr/local/cuda-9.0 ...
Installing the CUDA Samples in /home/user ...
Copying samples to /home/user/NVIDIA_CUDA-9.0_Samples now...
Finished copying samples.

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-9.0
Samples:  Installed in /home/user

Please make sure that                   # 提示添加变量
 -   PATH includes /usr/local/cuda-9.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root

 - PATH包括/usr/local/cuda-9.0/bin
 - LD_LIBRARY_PATH包含/usr/local/cuda-9.0/lib64,或者将/usr/local/cuda-9.0/lib64添加到/etc/ld.so.conf并以root身份运行ldconfig

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall    # 卸载方法

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_14141.log
user@gpu:/data/tools/tensorflow-gpu$ 

3.4 添加环境变量

user@gpu:~$ vim ~/.bashrc           # 在最后追加
# cuda
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
user@gpu:~$ source ~/.bashrc

3.5 验证

user@gpu:/data/tools/tensorflow-gpu$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

4 安装NVIDIA cuDNN 7_7.3.0.29-1

GPU加速深度学习
安装cudnn前先要确保cuda和NVIDIA驱动已正确安装

4.1 下载deb安装包

需要注册登录NVIDIA账户
https://developer.nvidia.com/cudnn
选择系统以及cuda对应的cudnn版本(ubuntu3个包)

user@gpu:/data/tools$ ll
总用量 1952872
drwxr-xr-x 3 user user        269 9月  14 13:25 ./
drwxr-xr-x 3 user user         19 9月  14 10:21 ../
-rw-rw-r-- 1 user user 1643293725 9月  22 16:35 cuda_9.0.176_384.81_linux.run
-rw-rw-r-- 1 user user 125687148 9月  22 16:33 libcudnn7_7.3.0.29-1+cuda9.0_amd64.deb
-rw-rw-r-- 1 user user 115870862 9月  22 16:33 libcudnn7-dev_7.3.0.29-1+cuda9.0_amd64.deb
-rw-rw-r-- 1 user user   4913038 9月  22 16:33 libcudnn7-doc_7.3.0.29-1+cuda9.0_amd64.deb

4.2 安装cuDNN

user@gpu:/data/tools$ sudo dpkg -i libcudnn7_7.3.0.29-1+cuda9.0_amd64.deb
正在选中未选择的软件包 libcudnn7。
(正在读取数据库 ... 系统当前共安装有 265027 个文件和目录。)
正准备解包 libcudnn7_7.3.0.29-1+cuda9.0_amd64.deb  ...
正在解包 libcudnn7 (7.3.0.29-1+cuda9.0) ...
正在设置 libcudnn7 (7.3.0.29-1+cuda9.0) ...
正在处理用于 libc-bin (2.23-0ubuntu10) 的触发器 ...

user@gpu:/data/tools$ sudo dpkg -i libcudnn7-dev_7.3.0.29-1+cuda9.0_amd64.deb
正在选中未选择的软件包 libcudnn7-dev。
(正在读取数据库 ... 系统当前共安装有 265033 个文件和目录。)
正准备解包 libcudnn7-dev_7.3.0.29-1+cuda9.0_amd64.deb  ...
正在解包 libcudnn7-dev (7.3.0.29-1+cuda9.0) ...
正在设置 libcudnn7-dev (7.3.0.29-1+cuda9.0) ...
update-alternatives: 使用 /usr/include/x86_64-linux-gnu/cudnn_v7.h 来在自动模式中提供 /usr/include/cudnn.h (libcudnn)

user@gpu:/data/tools$ sudo dpkg -i libcudnn7-doc_7.3.0.29-1+cuda9.0_amd64.deb
正在选中未选择的软件包 libcudnn7-doc。
(正在读取数据库 ... 系统当前共安装有 265039 个文件和目录。)
正准备解包 libcudnn7-doc_7.3.0.29-1+cuda9.0_amd64.deb  ...
正在解包 libcudnn7-doc (7.3.0.29-1+cuda9.0) ...
正在设置 libcudnn7-doc (7.3.0.29-1+cuda9.0) ...

4.3 验证cudnn是否安装成功

user@gpu:/data/tools$ cp -r /usr/src/cudnn_samples_v7 $HOME
user@gpu:/data/tools$ cd $HOME/cudnn_samples_v7/mnistCUDNN
user@gpu:~/cudnn_samples_v7/mnistCUDNN$ make clean && make
rm -rf *o
rm -rf mnistCUDNN
/usr/local/cuda/bin/nvcc -ccbin g++ -I/usr/local/cuda/include -IFreeImage/include  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o fp16_dev.o -c fp16_dev.cu
g++ -I/usr/local/cuda/include -IFreeImage/include   -o fp16_emu.o -c fp16_emu.cpp
g++ -I/usr/local/cuda/include -IFreeImage/include   -o mnistCUDNN.o -c mnistCUDNN.cpp
/usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o mnistCUDNN fp16_dev.o fp16_emu.o mnistCUDNN.o -I/usr/local/cuda/include -IFreeImage/include  -LFreeImage/lib/linux/x86_64 -LFreeImage/lib/linux -lcudart -lcublas -lcudnn -lfreeimage -lstdc++ -lm

#  执行./mnistCUDNN
user@gpu:~/cudnn_samples_v7/mnistCUDNN$ ./mnistCUDNN 
cudnnGetVersion() : 7300 , CUDNN_VERSION from cudnn.h : 7300 (7.3.0)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28  Capabilities 6.1, SmClock 1683.0 Mhz, MemSize (Mb) 11170, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.032384 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.044032 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.053248 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.076640 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.090112 time requiring 207360 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9×××88 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

Testing half precision (math in single precision)
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.031712 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.033664 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.074752 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.079872 time requiring 207360 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.084992 time requiring 2057744 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

成功安装,会提示“Test passed!”信息

5 采用原生 pip 方法安装TensorFlow

首先安装Python3.6和pip3

5.1 安装python3.6(可选)

ubuntu 16.04默认安装Python 2.7.12和Python 3.5.2

5.1.1 配置第三方软件仓库

sudo add-apt-repository ppa:jonathonf/python-3.6

5.1.2 检查系统软件包并安装Python3.6

sudo apt-get update
sudo apt-get install python3.6
sudo apt-get install python3.6-gdbm

5.1.3 把Python3.6改为Python3首选项

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.5 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 2
sudo update-alternatives --config python3

5.1.4 测试

终端输入

python3 -V

5.2 安装pip3

5.2.1 安装

sudo apt-get install python3-pip python3-dev

5.2.2 升级pip3(可选)

sudo pip3 install --upgrade pip
如果pip3安装其它软件报如下错误:
ImportError: cannot import name 'main'

修改如下:

sudo cp -a /usr/bin/pip3{,_backup}
sudo vim /usr/bin/pip3

将原来的:

from pip import main
if __name__ == '__main__':
    sys.exit(main())

改为:

from pip import __main__
if __name__ == '__main__':
    sys.exit(__main__._main())

6 安装 TensorFlow-GPU

6.1 安装

指定阿里镜像源,会快好多
sudo pip3 install --index-url https://mirrors.aliyun.com/pypi/simple tensorflow-gpu

也可以指定版本安装
sudo pip3 install --upgrade tfBinaryURL

tfBinaryURL 表示 TensorFlow Python 软件包的网址。tfBinaryURL 的正确值取决于操作系统、Python 版本和 GPU 支持。可在此处查找 tfBinaryURL 的相应值。例如,要为装有 Python 3.6 的 Linux 安装仅支持 CPU 的 TensorFlow,可发出以下命令:

sudo pip3 install --upgrade https://download.tensorflow.google.cn/linux/cpu/tensorflow-1.8.0-cp36-cp36m-linux_x86_64.whl

6.2 卸载(不操作)

sudo pip3 uninstall tensorflow

7 验证安装

运行一个简短的 TensorFlow 程序
从 shell 中调用 Python,如下所示:

我的python指向默认2.7,把python3指向了python3.6
$ python3

在 Python 交互式 shell 中输入以下几行简短的程序代码:

# Python
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

如果系统输出以下内容,说明您可以开始编写 TensorFlow 程序了:

Hello, TensorFlow!

参考

https://blog.csdn.net/Jonms/article/details/79318566