序号 | 修订内容 | 修订时间 |
---|---|---|
1 | 新增 | 20210708 |
2 | 支持tensorflow | 20210715 |
本文主要介绍cuda 的安装
[root@localhost ~]# cat /etc/centos-release
CentOS Linux release 7.6.1810 (Core)
[root@localhost ~]#
[root@localhost ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 4
座: 4
NUMA 节点: 1
厂商 ID: GenuineIntel
CPU 系列: 15
型号: 6
型号名称: Common KVM processor
步进: 1
CPU MHz: 2194.842
BogoMIPS: 4389.68
超管理器厂商: KVM
虚拟化类型: 完全
L1d 缓存: 32K
L1i 缓存: 32K
L2 缓存: 4096K
L3 缓存: 16384K
NUMA 节点0 CPU: 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology eagerfpu pni cx16 x2apic hypervisor lahf_lm
[root@localhost ~]# systemd-detect-virt
kvm
[root@localhost ~]#
cuda_10.2.89_440.33.01_linux.run
[root@localhost tmp]# ll
总用量 2583884
-rw-r--r-- 1 root root 2645419389 7月 8 11:04 cuda_10.2.89_440.33.01_linux.run
-rwx------. 1 root root 836 7月 8 10:45 ks-script-1Zymsq
-rw-------. 1 root root 0 7月 8 10:41 yum.log
-rwxr--r--. 1 root root 468336 7月 8 10:55 zabbix-agent-5.0.11-1.el7.x86_64.rpm
[root@localhost tmp]#
主要是检查机器上是否有gpu卡,系统运行级别、nouveau 模块是否禁用
安装lspci 命令,若已有该命令 请跳过
[root@localhost tmp]# yum install -y pciutils
经检查,该机器是有nvidia 卡
[root@localhost tmp]# lspci -nnk | grep NVI
00:10.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a2]
[root@localhost tmp]# lspci | grep -i nvidia
00:10.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
[root@localhost tmp]#
[root@localhost tmp]# yum install gcc gcc-c++
先检查下
[root@localhost tmp]# rpm -qa | grep kernel
kernel-tools-3.10.0-957.el7.x86_64
kernel-headers-3.10.0-957.el7.x86_64
kernel-tools-libs-3.10.0-957.el7.x86_64
kernel-3.10.0-957.el7.x86_64
[root@localhost tmp]#
可见,kernel-headers-3.10.0-957.el7.x86_64 已有,只需安装
kernel-devel-3.10.0-957.el7.x86_64
3.10.0-957.el7.x86_64 这个要和你自己的一致。
安装
[root@localhost tmp]# yum install kernel-devel-3.10.0-957.el7.x86_64
安装后检查
[root@localhost tmp]# rpm -qa | grep kernel
kernel-tools-3.10.0-957.el7.x86_64
kernel-headers-3.10.0-957.el7.x86_64
kernel-devel-3.10.0-957.el7.x86_64
kernel-tools-libs-3.10.0-957.el7.x86_64
kernel-3.10.0-957.el7.x86_64
[root@localhost tmp]#
首先检查nouveau 模块是否加载,已加载则先禁用
[root@localhost tmp]# lsmod | grep nouveau
nouveau 1869689 0
mxm_wmi 13021 1 nouveau
wmi 21636 2 mxm_wmi,nouveau
video 24538 1 nouveau
i2c_algo_bit 13413 1 nouveau
drm_kms_helper 179394 2 bochs_drm,nouveau
ttm 114635 2 bochs_drm,nouveau
drm 429744 5 ttm,bochs_drm,drm_kms_helper,nouveau
[root@localhost tmp]#
我这里就需要禁用了,
新建 /usr/lib/modprobe.d/blacklist-nouveau.conf
文件内容
blacklist nouveau
options nouveau modeset=0
[root@localhost tmp]# cd /usr/lib/modprobe.d/
[root@localhost modprobe.d]# ll
总用量 8
-rw-r--r--. 1 root root 382 10月 30 2018 dist-alsa.conf
-rw-r--r--. 1 root root 982 10月 31 2018 dist-blacklist.conf
[root@localhost modprobe.d]# vi blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
内核生效
[root@localhost modprobe.d]# dracut -force
[root@localhost modprobe.d]#
gpu 驱动必须在文本模式下进行(即运行级别3)
查看当前运行级别
[root@localhost modprobe.d]# runlevel
N 3
[root@localhost modprobe.d]#
我的是运行级别3,不需要设置。
若设置一般使用命令
systemctl set-default multi-user.target
可见重启之后 已没有加载 nouveau
[root@localhost modprobe.d]# reboot
Connection to 10.3.144.25 closed by remote host.
Connection to 10.3.144.25 closed.
[dev@10-3-170-32 base]$ ssh [email protected]
Last login: Thu Jul 8 11:10:56 2021 from 10.3.170.32
[root@localhost ~]# lsmod | grep nouveau
[root@localhost ~]#
安装用户是root
给安装文件 增加执行权限
[root@localhost tmp]# ll cuda_10.2.89_440.33.01_linux.run
-rw-r--r-- 1 root root 2645419389 7月 8 11:04 cuda_10.2.89_440.33.01_linux.run
[root@localhost tmp]# chmod u+x cuda_10.2.89_440.33.01_linux.run
[root@localhost tmp]#
安装命令
[root@localhost tmp]# ./cuda_10.2.89_440.33.01_linux.run --no-opengl-libs
[root@localhost tmp]# chmod u+x cuda_10.2.89_440.33.01_linux.run
[root@localhost tmp]# ./cuda_10.2.89_440.33.01_linux.run --no-opengl-libs
===========
= Summary =
===========
Driver: Installed
Toolkit: Installed in /usr/local/cuda-10.2/
Samples: Installed in /root/, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-10.2/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.2/lib64, or, add /usr/local/cuda-10.2/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.2/doc/pdf for detailed information on setting up CUDA.
Logfile is /var/log/cuda-installer.log
[root@localhost tmp]#
/etc/profile 配置cuda 环境变量
[root@localhost tmp]# cp /etc/profile /etc/profile.bak.20210708
[root@localhost tmp]# vi /etc/profile
[root@localhost tmp]#
文件尾部添加
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
[root@localhost tmp]# source /etc/profile
[root@localhost tmp]# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
[root@localhost tmp]#
[root@localhost tmp]# nvidia-smi
Thu Jul 8 14:40:58 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:10.0 Off | 0 |
| N/A 63C P0 26W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@localhost tmp]#
一般我们用proxmox 建虚机选的cpu 是默认kvm64,这个cpu 指令集对TensorFlow 不支持.(因为缺少avx avx2 指令集)
我这边解决办法:
查看宿主机cpu:
[root@localhost tmp]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
stepping : 7
microcode : 0x1
cpu MHz : 2194.842
cache size : 16384 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke avx512_vnni spec_ctrl intel_stibp arch_capabilities
bogomips : 4389.68
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
发现宿主机指令集是支持的。
新建虚机cpu 选用host 模式
然后按照第三章 部分操作即可。
刚开始我是在原有老虚机上编辑cpu,直接将cpu 类别改为host,但是重启虚机后一直报错。尝试了多种办法都没解决,并引起连锁反应,我重新建虚机也无法安装cuda。
之后我把宿主机重启了,然后使用4.1.1重建虚机,重新安装cuda 成功。
支持
支持
有始有终,我们需要掌握cuda 的卸载
GPU驱动卸载方法:
# /usr/bin/nvidia-uninstall
CUDA卸载方法(X.Y为CUDA版本号):
# /usr/local/cuda-X.Y/bin/cuda-uninstaller
或(老版本卸载方法)
# /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl