系统:redhat el7 同样适用于centos7
nvidia显卡驱动 版本: 410.129 CUDA Toolkit: 10.0
cuda 10.0
cudnn 7.6.5
python 3.7
tensorflow-gpu 1.14.0
第一步安装anaconda
bash Anaconda3-2020.02-Linux-x86_64.sh
vim /etc/profile
添加
#Anaconda
export PATH=/root/anaconda3/bin:$PATH
source /etc/profile
查看python版本
python --version
第二步安装nvidia驱动
1、禁止Linux系统自带的集成驱动
vim /etc/modprobe.d/blacklist.conf
使用vim在打开的文件后面添加几行:
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist rivatv
blacklist nvidiafb
chmod 666 /etc/modprobe.d/blacklist.conf
之后,重启服务器 reboot
重启之后执行:lsmod | grep nouveau
2.安装依赖包
在安装驱动的时候需要几个依赖包,比如gcc,g++,make
yum -y install gcc kernel-devel kernel-headers g++ make
去https://www.nvidia.cn/Download/index.aspx?lang=cn找到自己对应系统的显卡驱动
Tesla Driver for Linux RHEL 7
版本: 410.129
发布日期: 2019.9.4
操作系统: Linux 64-bit RHEL7
CUDA Toolkit: 10.0
语言: Chinese (Simplified)
文件大小: 111.5 MB
3、驱动安装 CUDA Toolkit: 10.0
rpm -ivh nvidia-diag-driver-local-repo-rhel7-410.129-1.0-1.x86_64.rpm
yum install -y cuda-drivers
执行如下命令,查询driver版本。
rpm -qa | grep -i nvidia
第三步,安装cuda10.0
1、如果下载的是sh:
sudo sh cuda_10.0.130_410.48_linux.run
有一个地方需要注意:在询问是否安装gpu drivers时要选择no
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
(y)es/(n)o/(q)uit: n
!!!!卸载sh 重装的时候用
!!!!cd /usr/local/cuda/bin
!!!!sudo ./uninstall_cuda_10.0.pl
reboot
查看
nvidia-smi
2、配置环境变量
vim ~/.bashrc
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
source ~/.bashrc
vim /etc/profile
export PATH=/usr/local/cuda/bin:$PATH
export CUDA_INSTALL_PATH=/usr/local/cuda
export LD_LIBRARY_PATH=$CUDA_INSTALL_PATH/lib64:$LD_LIBRARY_PATH
source /etc/profile
3、创建链接文件:
vim /etc/ld.so.conf.d/cuda.conf
添加如下语句:
/usr/local/cuda/lib64
然后执行
sudo ldconfig
第四步,解压cudnn-10.0-7.6.5
tar -zxvf cudnn-10.0-linux-x64-v7.6.5.326.tgz
cp /data/software/cuda/include/cudnn.h /usr/local/cuda/include
cp /data/software/cuda/lib64/libcudnn* /usr/local/cuda/lib64
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
更新一下链接
cd /usr/local/cuda/lib64
sudo ln -sf libcudnn.so.7.6.5 libcudnn.so.7
sudo ln -sf libcudnn.so.7 libcudnn.so
sudo ldconfig
查看一下nvcc的信息验证安装是否成功。
nvcc -V
nvcc --version
卸载
rpm -qa | grep -i nvidia
yum remove nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64
rpm -qa | grep -i cuda
yum remove cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64
yum clean all
运行实例代码报错
ImportError: /lib64/libm.so.6: version `GLIBC_2.23' not found (required by /root/anaconda3/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
1. glibc下载
从http://www.gnu.org/software/libc/ 下载源代码
2. 安装
tar -zxvf glibc-2.23.tar.gz
cd glibc-2.23
mkdir build
cd build
echo $LD_LIBRARY_PATH
../configure --prefix=/data/glibc-2.23 (../configure --prefix=/usr --disable-profile --enable-kernel=2.6.32 --enable-obsolete-rpc)
报错 LD_LIBRARY_PATH shouldn't contain the current directory
方法:临时修改 vim ~/.bashrc
LD_LIBRARY_PATH=
(/usr/local/cuda-10.0/lib64:
export LD_LIBRARY_PATH=
echo $LD_LIBRARY_PATH)
编译后记得修改回来
make -j4
make install
继续报错
gawk: error while loading shared libraries: /lib64/libm.so.6: invalid ELF header
make[2]: *** [/data/software/glibc-2.23/build/math/stubs] Error 127
make[2]: Leaving directory `/data/software/glibc-2.23/math'
make[1]: *** [math/subdir_install] Error 2
make[1]: Leaving directory `/data/software/glibc-2.23'
make: *** [install] Error 2
解决方法
cd /lib64
ls -l | grep libm
发现有新的libm-2.23.so文件生成,这个时候更改了一下软链:
unlink libm.so.6
ln -s libm-2.23.so libm.so.6
继续make install