linux服务器搭建深度学习环境

系统:redhat el7 同样适用于centos7

nvidia显卡驱动  版本:     410.129    CUDA Toolkit:     10.0

cuda 10.0

cudnn 7.6.5

python 3.7

tensorflow-gpu 1.14.0

第一步安装anaconda
bash Anaconda3-2020.02-Linux-x86_64.sh
vim /etc/profile
添加
#Anaconda
export PATH=/root/anaconda3/bin:$PATH

source /etc/profile

查看python版本
python --version


第二步安装nvidia驱动
1、禁止Linux系统自带的集成驱动
vim /etc/modprobe.d/blacklist.conf

使用vim在打开的文件后面添加几行:

blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist rivatv
blacklist nvidiafb

chmod 666 /etc/modprobe.d/blacklist.conf

之后,重启服务器 reboot

重启之后执行:lsmod | grep nouveau


2.安装依赖包
在安装驱动的时候需要几个依赖包,比如gcc,g++,make

yum -y install gcc kernel-devel kernel-headers g++ make

去https://www.nvidia.cn/Download/index.aspx?lang=cn找到自己对应系统的显卡驱动
Tesla Driver for Linux RHEL 7
 
版本:     410.129
发布日期:     2019.9.4
操作系统:     Linux 64-bit RHEL7
CUDA Toolkit:     10.0
语言:     Chinese (Simplified)
文件大小:     111.5 MB

3、驱动安装  CUDA Toolkit:     10.0
rpm -ivh nvidia-diag-driver-local-repo-rhel7-410.129-1.0-1.x86_64.rpm

yum install -y cuda-drivers

执行如下命令,查询driver版本。
rpm -qa | grep -i nvidia


第三步,安装cuda10.0

1、如果下载的是sh:
sudo sh cuda_10.0.130_410.48_linux.run

有一个地方需要注意:在询问是否安装gpu drivers时要选择no

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
(y)es/(n)o/(q)uit: n

!!!!卸载sh 重装的时候用
!!!!cd /usr/local/cuda/bin
!!!!sudo ./uninstall_cuda_10.0.pl

reboot

查看
nvidia-smi

2、配置环境变量

vim ~/.bashrc

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"

source ~/.bashrc

vim /etc/profile

export PATH=/usr/local/cuda/bin:$PATH
export CUDA_INSTALL_PATH=/usr/local/cuda
export LD_LIBRARY_PATH=$CUDA_INSTALL_PATH/lib64:$LD_LIBRARY_PATH

source /etc/profile

3、创建链接文件:

vim /etc/ld.so.conf.d/cuda.conf

添加如下语句:
/usr/local/cuda/lib64

然后执行
sudo ldconfig

 


第四步,解压cudnn-10.0-7.6.5

tar -zxvf cudnn-10.0-linux-x64-v7.6.5.326.tgz

cp /data/software/cuda/include/cudnn.h /usr/local/cuda/include
cp /data/software/cuda/lib64/libcudnn* /usr/local/cuda/lib64
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

更新一下链接

cd /usr/local/cuda/lib64
sudo ln -sf libcudnn.so.7.6.5 libcudnn.so.7  
sudo ln -sf libcudnn.so.7 libcudnn.so  
sudo ldconfig  

查看一下nvcc的信息验证安装是否成功。
nvcc -V
nvcc --version

 

卸载
rpm -qa | grep -i nvidia
yum remove nvidia-driver-local-repo-rhel7-440.64.00-1.0-1.x86_64

rpm -qa | grep -i cuda
yum remove cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64
yum clean all


运行实例代码报错
ImportError: /lib64/libm.so.6: version `GLIBC_2.23' not found (required by /root/anaconda3/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)

 1. glibc下载

从http://www.gnu.org/software/libc/ 下载源代码

2. 安装
tar -zxvf glibc-2.23.tar.gz
cd glibc-2.23
mkdir build
cd build
echo $LD_LIBRARY_PATH
../configure --prefix=/data/glibc-2.23    (../configure --prefix=/usr --disable-profile --enable-kernel=2.6.32 --enable-obsolete-rpc)
报错 LD_LIBRARY_PATH shouldn't contain the current directory

方法:临时修改 vim ~/.bashrc
LD_LIBRARY_PATH=

(/usr/local/cuda-10.0/lib64:
export LD_LIBRARY_PATH=
echo $LD_LIBRARY_PATH)

编译后记得修改回来

make -j4
make install


继续报错
gawk: error while loading shared libraries: /lib64/libm.so.6: invalid ELF header
make[2]: *** [/data/software/glibc-2.23/build/math/stubs] Error 127
make[2]: Leaving directory `/data/software/glibc-2.23/math'
make[1]: *** [math/subdir_install] Error 2
make[1]: Leaving directory `/data/software/glibc-2.23'
make: *** [install] Error 2

 

解决方法

cd /lib64
ls -l | grep libm
发现有新的libm-2.23.so文件生成,这个时候更改了一下软链:

unlink libm.so.6

ln -s libm-2.23.so libm.so.6

继续make install

 

 

 

你可能感兴趣的:(深度学习)