因为业务需要多卡模式,tf2+势在必行,不然真的没法玩。考虑到之前安装tf1+的辛苦,这次还是用conda,然而默认的源安装速度太难了,遂改用清华源。
官方安装地址没看懂。搜了下可行的方法,如下:
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes
conda config --show
恢复默认源的方法:
conda config --remove-key channels
而我之前成功的配置环境是:
$ nvidia-smi
Thu May 13 17:31:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 23% 24C P8 9W / 250W | 259MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A |
| 23% 25C P8 8W / 250W | 259MiB / 11178MiB | 0% Default |
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$ pip list
tensorboard 2.4.0
tensorboard-plugin-wit 1.7.0
tensorboardX 2.1
tensorflow-addons 0.12.0
tensorflow-datasets 4.1.0
tensorflow-estimator 2.3.0
tensorflow-gpu 2.3.0
tf-estimator-nightly 2.5.0.dev2021010501
tf-models-official 2.3.0
tf-slim 1.1.0
现如今是3090的卡,cuda是11.2的,cuda 驱动是460.73,没有cuda-toolkit。。。。。。。。
$ nvcc -V
Command 'nvcc' not found, but can be installed with:
apt install nvidia-cuda-toolkit
Please ask your administrator.
本来想用conda安装下这个玩意,但是conda安装tf后就死了,我也没办法。
conda install cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/linux-64/
cudnn也可以类似的安装。
那么问题来了,我直接安装tf2,cudatoolkit,cudnn一步到位。下面删了conda文件夹重新开始。
查看Ubuntu版本方法,
cat /proc/version
安装cuda方法,一些详细的安装及检验的方法,还有这个。
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
sudo sh cuda_11.1.0_455.23.05_linux.run
算了,明天问下运维吧。。。。。。。。。没有root权限真是烦死了。
【0514补充】
拿到sudo权限,卸载cuda及驱动。想安装cuda必须先安装驱动,安装驱动地址(需要下载run)。cuda与驱动的兼容性地址,如下,
driver我选择了最低的适合3090的版本455.38的,我看能不能安装cuda10.1
安装驱动必须先卸载之前的cuda及驱动及toolkit相关的一切,否则安装失败。
sudo apt remove "*cublas*" "cuda*"
sudo apt remove "*nvidia*"
卸载后仍旧安装出错了,
sudo sh NVIDIA-Linux-x86_64-455.38.run
再次卸载,安装,还是同上的错误,我也是无语了,咋整啊?
sudo apt-get --purge remove "*nvidia*"
我又搜了个删除的方法,如下,
sudo apt-get purge nvidia*
sudo apt-get autoremove
sudo reboot
重启后解决了,安装后驱动后,cuda也自动安装了??
# nvidia-smi
Fri May 14 11:39:07 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
cuda与驱动的兼容性,如下:
cuda自己安装,下载的如下,
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run
#按照上面的改的10.1.243,及418.87.00
# sudo sh cuda_10.1.243_418.87.00_linux.run
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-10.1/
Samples: Installed in /root/, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-10.1/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.1/doc/pdf for detailed information on setting up CUDA.
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 418.00 is required for CUDA 10.1 functionality to work.
To install the driver using this installer, run the following command, replacing with the name of this run file:
sudo .run --silent --driver
Logfile is /var/log/cuda-installer.log
根据上面提示将地址加到bashrc,然后source即可用nvcc
#add nvcc by 最帅的小明哥
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib
export PATH=$PATH:/usr/local/cuda-10.1/bin
现在仍旧有问题如下,运行脚本中的bug
Not creating XLA devices, tf_xla_enable_xla_devices not set
安装cuda11.1吧,没办法cuda10.1似乎不好整,主要还是455的驱动直接限制了cuda也要高。
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
sudo sh cuda_11.1.0_455.23.05_linux.run
然而只要用conda安装tensorflow-gpu,它就自动安装cudatoolkit10.1.。。。。根本就不管tf是啥版本的,也不管我的环境是啥。这真是垃圾。
采用tf-gpu官方的方法再试一次,为防止变更,贴出来如下:
# Add NVIDIA package repositories
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update
# Install NVIDIA driver
sudo apt-get install --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
sudo apt-get update
# Install development and runtime libraries (~4GB)
sudo apt-get install --no-install-recommends \
cuda-11-0 \
libcudnn8=8.0.4.30-1+cuda11.0 \
libcudnn8-dev=8.0.4.30-1+cuda11.0
# Install TensorRT. Requires that libcudnn8 is installed above.
sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \
libnvinfer-dev=7.1.3-1+cuda11.0 \
libnvinfer-plugin7=7.1.3-1+cuda11.0
官方给的版本对比(gpu,tf-gpu,bazel,gcc,cuda兼容性)
报错啊,见我的issue
心累啊。