深度学习一般需要显卡支持,现在市面上常见的显卡有GeForce GTX 1080Ti、GeForce GTX 2080Ti。硬件环境具备后,我们需要安装软件环境,目前市面上最流行的主要有TensorFlow(Google,主要用于工程),PyTorch(FaceBook,主要用于学术研究)。虽然目前网上有很多安装教程,但是本人看过十几个教程,都没有安装成功,很多原因是由于国内网络的原因,最后折腾了两天,综合了一下各个博客,终于安装成功,下面我分享一下安装过程。
lspci |grep -i vga
我这里机器集成的是GeForce GTX 2080Ti显卡,结果如下
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3b:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
cat cat /etc/redhat-release
系统使用的是centos7.7,结果如下
CentOS Linux release 7.7.1908 (Core)
cudatoolkit、cudnn、TensorFlow/PyTorch的版本必须对应,虽然Google官网上提供了一些对应关系参考,但那只是很少的一部分,一般很难找到自己的版本,所以我们采用conda来安装cudatoolkit、cudnn、TensorFlow/PyTorch。cudatoolkit、cudnn是不需要单独去安装的,我们直接安装TensorFlow/PyTorch,对应的cudatoolkit、cudnn会自动去安装。安装流程如下:
显卡驱动 —> Anaconda —> cudatoolkit/cudnn/TensorFlow
显卡驱动 —> Anaconda —> cudatoolkit/cudnn/PyTorch
安装可以在默认虚拟环境安装,也可以创建虚拟环境安装
显卡驱动版本可以通过下载链接进行查看,默认一般是在官网下载,考虑到网速的问题,如果能通过其他地方下载到也可以,这里提供一下GeForce GTX 2080Ti显卡的下载地址:GeForce GTX 2080Ti
下载:nvidia官方下载
安装可以参考驱动安装部分,写的很完整:Centos7 安装独立显卡驱动
检查是否安装成功:watch -n 1 nvidia-smi # 1表示每1秒刷新一次
,出现下面内容说明安装成功
Every 1.0s: nvidia-smi Sun Mar 1 11:45:55 2020
Sun Mar 1 11:45:56 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|=++==============|
| 0 GeForce RTX 208… Off | 00000000:3B:00.0 Off | N/A |
| 36% 36C P0 18W / 260W | 0MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
下载:Anaconda3-4.3.0-Linux-x86_64.sh
安装
bash Anaconda3-4.3.0-Linux-x86_64.sh
配置国内镜像(由于网络问题,不配置国内镜像源基本上后面无法安装)
推荐使用清华镜像,网上有很多配置方法,但是基本上存在问题,官网的配置比较完整:Anaconda 镜像使用帮助
设置conda下载的timeout,由于网络的原因,最好设置大一点,我设置为2小时
conda config --set remote_read_timeout_secs 7200
验证是否安装成功
执行 python
命令,出现如下内容说明安装成功
Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 12:22:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
安装命令如下,如果不指定版本,默认会安装最新版本
conda install tensorflow-gpu==1.12.0
如果下面的错误,说明conda下载的timeout时间太短,可以通过Anaconda安装第4步重新设置
CondaError: Downloaded bytes did not match Content-Length
url: https://repo.anaconda.com/pkgs/main/linux-64/cudnn-7.0.5-cuda8.0_0.tar.bz2
target_path: /home/yyf/miniconda3/pkgs/cudnn-7.0.5-cuda8.0_0.tar.bz2
Content-Length: 261398285
downloaded bytes: 47463195
如果还是出现这个问题,需要通过安装日志在清华镜像中手工下载并进行安装
The following NEW packages will be INSTALLED:
blas pkgs/main/linux-64::blas-1.0-mkl
cudatoolkit pkgs/main/linux-64::cudatoolkit-9.2-0
freetype pkgs/main/linux-64::freetype-2.9.1-h8a8886c_1
intel-openmp pkgs/main/linux-64::intel-openmp-2020.0-166
jpeg pkgs/main/linux-64::jpeg-9b-h024ee3a_2
libgfortran-ng pkgs/main/linux-64::libgfortran-ng-7.3.0-hdf63c60_0
libpng pkgs/main/linux-64::libpng-1.6.37-hbc83047_0
libtiff pkgs/main/linux-64::libtiff-4.1.0-h2733197_0
mkl pkgs/main/linux-64::mkl-2020.0-166
mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py36he904b0f_0
mkl_fft pkgs/main/linux-64::mkl_fft-1.0.15-py36ha843d7b_0
mkl_random pkgs/main/linux-64::mkl_random-1.1.0-py36hd6b4f25_0
ninja pkgs/main/linux-64::ninja-1.9.0-py36hfd86e86_0
numpy pkgs/main/linux-64::numpy-1.18.1-py36h4f9e942_0
numpy-base pkgs/main/linux-64::numpy-base-1.18.1-py36hde5b4d6_1
olefile pkgs/main/linux-64::olefile-0.46-py36_0
pillow pkgs/main/linux-64::pillow-7.0.0-py36hb39fc2d_0
pytorch pytorch/linux-64::pytorch-1.4.0-py3.6_cuda9.2.148_cudnn7.6.3_0
six pkgs/main/linux-64::six-1.14.0-py36_0
torchvision pytorch/linux-64::torchvision-0.5.0-py36_cu92
zstd pkgs/main/linux-64::zstd-1.3.7-h0b5b093_0
conda install --offline ./cudatoolkit-9.2-0.tar.bz2
conda install --offline ./cudnn-7.6.5-cuda9.2_0.tar.bz2
from tensorflow.python.client import device_lib
print([device.device_type for device in device_lib.list_local_devices()])
如果类似下面内容,说明安装成功
2020-03-01 11:49:12.188393: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2020-03-01 11:49:13.759115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:3b:00.0
totalMemory: 10.76GiB freeMemory: 10.60GiB
2020-03-01 11:49:13.759191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2020-03-01 11:49:14.102316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-01 11:49:14.102378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2020-03-01 11:49:14.102390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2020-03-01 11:49:14.102494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 10232 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:3b:00.0, compute capability: 7.5)
[‘CPU’, ‘XLA_CPU’, ‘GPU’, ‘XLA_GPU’]
tensorflow会依赖于keras,所以一般跑训练时会报keras模块不存在,安装命令如下
pip install keras=2.*.*
注意: tensorflow与keras的版本必须对应,否则无法使用,对应关系参考: tensorflow与keras的版本对应关系.
import torch
a = torch.cuda.is_available()
print(a)
ngpu= 1
# Decide which device we want to run on
device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu")
print(device)
print(torch.cuda.get_device_name(0))
print(torch.rand(3,3).cuda())
结果如果类似下面内容,说明安装成功
True
cuda:0
GeForce GTX 1080
tensor([[0.9530, 0.4746, 0.9819],
[0.7192, 0.9427, 0.6768],
[0.8594, 0.9490, 0.6551]], device=‘cuda:0’)