CUDA9.1上安装tensorflow-gpu的过程

前言

 在讲tensorflow gpu版本的安装过程前,先吐槽一下,将pytorch deeplabv3+模型转换到onnx真是太坑人了,先是说pytorch1.1.0版本有问题,尔后好不容易切换到pytorch1.0.1或1.2.0,又遇到ONNX协议不支持算法模型里面的算子。来回折腾好几天,实在不行只好尝试用tensorflow来训练deeplabv3+模型。PS:如果谁有pytorch deeplabv3+转onnx的经验,请教我一下:)

安装

本人平台是ubuntu18.0.4+CUDA9.1。

输入命令: pip3 install /work/xxx/tensorflow_gpu-1.8.0-cp36-cp36m-manylinux1_x86_64.whl可以很顺利把tensorflow gpu版本安装完,但是import tensorflow会遇到找不到libcublas.so.9.0/libcudart.so.9.0的错误。 原因是tensorflow_gpu1.8.0版本只能安装在CUDA9.0上,它在运行时会去找cuda9.0库。

经过试验发现,tensorflow_gpu1.6到1.8都只能在CUDA9.0上运行。 网上很多建议将CUDA9.X版本卸掉来重新安装CUDA9.0,这里面其实挺麻烦的,还要涉及到gcc版本的降低等。

我这里发现, tensorflow_gpu-1.14.0-cp36-cp36m-manylinux1_x86_64.whl就可以在CUDA9.1上运行,当然我们需要事先去这里下载https://files.pythonhosted.org/packages/76/04/43153bfdfcf6c9a4c38ecdb971ca9a75b9a791bb69a764d652c359aca504/tensorflow_gpu-1.14.0-cp36-cp36m-manylinux1_x86_64.whl

安装完后,import tensorflow能成功,但是会遇到下面所示的警告。

  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/bc311/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/bc311/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

这是查看本人平台的numpy版本,为1.19.1。 

bcxxx@bcxxx-ai1:~$ pip3 show numpy
Name: numpy
Version: 1.19.1
Summary: NumPy is the fundamental package for array computing with Python.

应该是numpy版本太高所致,所以重新安装它,其版本号为1.14.0。这样tensorflow就能完全加载成功了。 

 pip3 install numpy==1.14.0

后续补充, 虽然import tensorflow没问题了,但是发现1.14 tensorflow gpu版本实际上还是会调用cuda10.0版本库。判断tensorflow能否正常的调用gpu device, 可以使用下面这条命令:

tensorflow.test.gpu_device_name()

在我的平台上会出现gpu device load失败的问题,这样意味着只能使用cpu来训练数据了。

tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...

再后续补充, 安装tensorflow1.7.0版本貌似可以使用cuda9.1了,其安装命令如下:

pip3 install ./tensorflow-1.7.0-cp36-cp36m-linux_x86_64.whl

cuda测试命令及其结果如下:

>>> import tensorflow
>>> tensorflow.test.gpu_device_name()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:65:00.0
totalMemory: 10.92GiB freeMemory: 10.65GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/device:GPU:0 with 10310 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
'/device:GPU:0'

 

 

 

 

 

 

你可能感兴趣的:(深度学习)