ROCm 1.91之后不需要安装AMD GPU驱动程序。请参考新的安装流程:
通过AMD开发ROCm平台,TensorFlow可以使用AMD GPU实现GPU加速。现将搭建流程呈上。
硬件:
CPU:AMD Ryzen 1700x
GPU:AMD Radeon RX580
内存:32G
硬盘:SSD 256GB + HDD 2TB
安装Ubuntu 18.04
网上很多Ubuntu安装教程,这里不在赘述。我选的是最小化安装。
安装AMD GPU驱动程序
下载最新的驱动程序,我使用的是18.20版本。
以下载到Downloads目录为例
cd ~/Downloads
tar -Jxvf amdgpu-pro-18.20-606296.tar.xz
cd ~/Downloads/amdgpu-pro-18.20-606296
./amdgpu-pro-install –opencl=legacy
安装ROCm
增加ROCm的仓库
wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
sudo sh -c 'echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'
然后运行
sudo apt update
sudo apt install rocm-dkms
安装会报错,因为amdgpu这个AMD GPU的驱动程序在使用同一DKMS,我们强制安装这个包
sudo dpkg -i –force-overwrite /var/cache/apt/archives/rock-dkms_1.8-192_all.deb
sudo apt install -f
重新启动
sudo reboot
至此安装完毕。
可以使用rocminfo测试一下是否安装成功。
/opt/rocm/bin/rocminfo
安装TensorFlow(ROCm port)
下载TensorFlow的ROCm专用轮子
然后安装相关软件包
sudo apt-get update && \
sudo apt-get install -y --allow-unauthenticated \
rocm-dkms rocm-dev rocm-libs \
rocm-device-libs \
hsa-ext-rocr-dev hsakmt-roct-dev hsa-rocr-dev \
rocm-opencl rocm-opencl-dev \
rocm-utils \
rocm-profiler cxlactivitylogger \
miopen-hip miopengemm
然后安装python相关软件包
sudo apt-get update && sudo apt-get install -y \
python3-numpy \
python3-dev \
python3-wheel \
python3-mock \
python3-future \
python3-pip \
python3-yaml \
python3-setuptools
安装之后安装我们的轮子(以Downloads目录为例)
sudo pip3 install ~/Downloads/tensorflow-1.8.0-cp35-cp35m-manylinux1_x86_64.whl
估计你安装不上。
会报错,因为18.04已经自动升级python为3.6了。没关系,把文件名里的35改成36,可以正常安装。不过在每次运行TensorFlow时会报错。
测试一下吧
Python 3.6.5 (default, Apr 1 2018, 05:46:30)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-07-22 18:59:14.289004: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-22 18:59:14.296182: W tensorflow/stream_executor/rocm/rocm_driver.cc:404] creating context when one is currently active; existing: 0x7fa28910d130
2018-07-22 18:59:14.296312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1451] Found device 0 with properties:
name: Ellesmere [Radeon RX 470/480]
AMDGPU ISA: gfx803
memoryClockRate (GHz) 1.266
pciBusID 0000:09:00.0
Total memory: 8.00GiB
Free memory: 7.75GiB
2018-07-22 18:59:14.296337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Adding visible gpu devices: 0
2018-07-22 18:59:14.296360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-22 18:59:14.296372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:995] 0
2018-07-22 18:59:14.296384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1008] 0: N
2018-07-22 18:59:14.296429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1124] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7539 MB memory) -> physical GPU (device: 0, name: Ellesmere [Radeon RX 470/480], pci bus id: 0000:09:00.0)
>>> sess.run(hello)
b'Hello, TensorFlow!'
>>> b = tf.constant(32)
>>> sess.run(a+b)
42
>>> sess.close()
>>> exit()
更新 2018/9/13
升级使用Ubuntu的最新内核4.15.0-34会导致驱动加载错误,运行rocminfo会出现错误
hsa api call failure at line 900, file: /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.8/rocminfo/rocminfo.cc. Call returned 4104
解决方案是删除新内核使用原内核4.15.0-33
sudo dpkg --get-selections | grep linux #查看已安装的内核
sudo apt remove linux-image-4.15.0-34-generic #删除新内核
sudo apt install linux-image-4.15.0-33-generic #安装原内核
更新 2018/11/02
ROCm1.91版本不需要安装AMD GPU驱动。
内核4.15.0-38测试通过。