conda清华源安装tensorflow2+

因为业务需要多卡模式,tf2+势在必行,不然真的没法玩。考虑到之前安装tf1+的辛苦,这次还是用conda,然而默认的源安装速度太难了,遂改用清华源。

For Recommendation in Deep learning QQ Group 102948747
For Visual in deep learning QQ Group 629530787
I'm here waiting for you 

不接受这个网页的私聊/私信!!!
 

 官方安装地址没看懂。搜了下可行的方法,如下:

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes
conda config --show

恢复默认源的方法:

conda config --remove-key channels

而我之前成功的配置环境是:

$ nvidia-smi
Thu May 13 17:31:59 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 23%   24C    P8     9W / 250W |    259MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |    259MiB / 11178MiB |      0%      Default |

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$ pip list
tensorboard                   2.4.0
tensorboard-plugin-wit        1.7.0
tensorboardX                  2.1
tensorflow-addons             0.12.0
tensorflow-datasets           4.1.0
tensorflow-estimator          2.3.0
tensorflow-gpu                2.3.0
tf-estimator-nightly          2.5.0.dev2021010501
tf-models-official            2.3.0
tf-slim                       1.1.0

现如今是3090的卡,cuda是11.2的,cuda 驱动是460.73,没有cuda-toolkit。。。。。。。。

$ nvcc -V

Command 'nvcc' not found, but can be installed with:

apt install nvidia-cuda-toolkit
Please ask your administrator.

本来想用conda安装下这个玩意,但是conda安装tf后就死了,我也没办法。

conda install cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/linux-64/

cudnn也可以类似的安装。

那么问题来了,我直接安装tf2,cudatoolkit,cudnn一步到位。下面删了conda文件夹重新开始。

查看Ubuntu版本方法,

cat /proc/version

安装cuda方法,一些详细的安装及检验的方法,还有这个。

wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
sudo sh cuda_11.1.0_455.23.05_linux.run

算了,明天问下运维吧。。。。。。。。。没有root权限真是烦死了。

愿我们终有重逢之时,而你还记得我们曾经讨论的话题。

【0514补充】

拿到sudo权限,卸载cuda及驱动。想安装cuda必须先安装驱动,安装驱动地址(需要下载run)。cuda与驱动的兼容性地址,如下,

conda清华源安装tensorflow2+_第1张图片

conda清华源安装tensorflow2+_第2张图片

 driver我选择了最低的适合3090的版本455.38的,我看能不能安装cuda10.1

conda清华源安装tensorflow2+_第3张图片

安装驱动必须先卸载之前的cuda及驱动及toolkit相关的一切,否则安装失败。

sudo apt remove "*cublas*" "cuda*"
sudo apt remove "*nvidia*"

卸载后仍旧安装出错了,

sudo sh NVIDIA-Linux-x86_64-455.38.run

conda清华源安装tensorflow2+_第4张图片

再次卸载,安装,还是同上的错误,我也是无语了,咋整啊?

sudo apt-get --purge remove "*nvidia*"

我又搜了个删除的方法,如下,

sudo apt-get purge nvidia*
sudo apt-get autoremove
sudo reboot

重启后解决了,安装后驱动后,cuda也自动安装了??

# nvidia-smi
Fri May 14 11:39:07 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |

cuda与驱动的兼容性,如下:

conda清华源安装tensorflow2+_第5张图片

cuda自己安装,下载的如下,

wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run
#按照上面的改的10.1.243,及418.87.00

# sudo sh cuda_10.1.243_418.87.00_linux.run
===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-10.1/
Samples:  Installed in /root/, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-10.1/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.1/doc/pdf for detailed information on setting up CUDA.
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 418.00 is required for CUDA 10.1 functionality to work.
To install the driver using this installer, run the following command, replacing  with the name of this run file:
    sudo .run --silent --driver

Logfile is /var/log/cuda-installer.log

根据上面提示将地址加到bashrc,然后source即可用nvcc

#add nvcc by 最帅的小明哥
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib
export PATH=$PATH:/usr/local/cuda-10.1/bin

现在仍旧有问题如下,运行脚本中的bug

Not creating XLA devices, tf_xla_enable_xla_devices not set

安装cuda11.1吧,没办法cuda10.1似乎不好整,主要还是455的驱动直接限制了cuda也要高。

wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
sudo sh cuda_11.1.0_455.23.05_linux.run

然而只要用conda安装tensorflow-gpu,它就自动安装cudatoolkit10.1.。。。。根本就不管tf是啥版本的,也不管我的环境是啥。这真是垃圾。

采用tf-gpu官方的方法再试一次,为防止变更,贴出来如下:

# Add NVIDIA package repositories
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update

# Install NVIDIA driver
sudo apt-get install --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi

wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
sudo apt-get update

# Install development and runtime libraries (~4GB)
sudo apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0


# Install TensorRT. Requires that libcudnn8 is installed above.
sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \
    libnvinfer-dev=7.1.3-1+cuda11.0 \
    libnvinfer-plugin7=7.1.3-1+cuda11.0

官方给的版本对比(gpu,tf-gpu,bazel,gcc,cuda兼容性)

conda清华源安装tensorflow2+_第6张图片

报错啊,见我的issue

心累啊。

 

你可能感兴趣的:(DaLao's,Notebook,cuda,tensorflow2,cudatoolkit,cudnn)