GPU 在以下方面有别于 CPU:
上图中,绿色的部分是计算单元(ALU),就是我们用来进行加减乘除运算的部分;橙红色的部分是存储单元(Cache),用来储存计算的相关数据;橙黄色部分是控制单元(Control)。由于GPU的工作对象数据结构相对简单,其绝大部分硬件资源用来进行计算,而CPU所面对的计算任务则要复杂的多,所以不得不花费相当多的资源在其他方面。
机器学习和深度学习在处理大的数据集的时候,如果使用传统的CPU服务器或工作站,只能采用out-of-core或者采用数据子集的方式进行训练,这往往意味着训练模型精准度的下降:比如out-of-core的方式,必须解决梯度下降是否合理的问题,传统方法是梯度截断以及最新的FTRL,以上方法都能解决一些问题,但是缺陷也比较明显:比如梯度截断很容易陷入梯度爆炸或者梯度消失、FTRL可适应的算法模型又比较少(出来的比较晚,在深度学习上基本没有应用,不过在on-line-learning方面有很好的前景);数据子集采用随机采样或者部分截取,会丢掉一些周期性的关键信息;而使用GPU的话,可以训练更大的数据集、更快的得到训练结果,这也意味着更快的迭代,更好的学习效果。
本文基于阿里给的一台GPU云主机,讲解如何在CentOS7.4上搭建以基于TensorFlow的深度学习环境。
安装zsh, oh-my-zsh
zsh与oh-my-zh可以极大的提高Linux shell环境的命令输入效率,更加强大的自动提示功能,各种强大的插件、主题。
zsh安装:
1. yum search zsh
2. yum install -y zsh.x86_64
oh-my-zsh安装
1. yum install git
2. sh -c "$(wget https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh -O -)"
3. chsh -l:
4. chsh -s /bin/zsh
pyenv是Python的虚拟环境管理插件,有这个插件,可以非常方便的安装不同版本的Python、anaconda。
$ git clone https://github.com/yyuu/pyenv.git ~/.pyenv #使用 git 把 pyenv 下载到家目录
$ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc #然后需要修改环境变量,使用 Bash Shell 的输入
$ echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
$ echo 'eval "$(pyenv init -)"' >> ~/.zshrc #最后添加 pyenv init
$ exec $SHELL -l #输入命令重启 Shell,然后就可以重启pyenv
pyenv-virtual 插件保证在使用pyenv locan anacon*设置目录环境的时候,能够在terminal显示正确的虚拟环境名称。
git clone https://github.com/yyuu/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv
echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.zshrc
source ~/.zshrc
anaconda是一个打包好的Python科学计算包,里面包含Python以及大部分科学计算需要的安装包,比如pandas, matplotlib, statemodel等。
pyenv install --list | grep anaconda #获取可以安装的anaconda版本
pyenv install anaconda3-5.1.0
首先检查是否有支持英伟达CUDA的显卡:
1. lspci | grep -i nvidia
2. 00:08.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
3. 00:09.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
检查显卡驱动以及型号:
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
yum install nvidia-detect
nvidia-detect -v
去NVIDIA官网下载对应型号、版本的显卡驱动。
禁用系统自带的显卡驱动:
#新建一个配置文件
sudo vim /etc/modprobe.d/blacklist-nouveau.conf
#sudo vim /usr/lib/modprobe.d/dist-blacklist.conf #上面的的命令最后失败的话使用这个
#写入以下内容
blacklist nouveau
options nouveau modeset=0
#保存并退出
:wq
#备份当前的镜像
sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
#建立新的镜像
sudo dracut /boot/initramfs-$(uname -r).img $(uname -r)
#重启
sudo reboot
#最后输入上面的命令验证
lsmod | grep nouveau
#如果还不行可以尝试下面的命令再重启:
sudo dracut --force
安装显卡驱动;
chmod +x NVIDIA-Linux-x86_64-390.48.run
sh NVIDIA-Linux-x86_64-390.48.run
可能会出现以下warning:
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
忽略不管。检测显卡是否安装成功:
nvidia-smi
CUDA安装方式推荐使用runfile,使用rmp在线安装会出错,很难解决。下载时配置选择如下:
如果下载不动,可以手动更改DNS到1.1.1.1,传说中最快的DNS。
runfile中,实际上包含了三个文件:显卡驱动、CUDA、sample,我们这里只需要最后两个,所以先解压这个文件:
./cuda_7.5.18_linux.run --extract=$HOME
./cuda-linux64-rel-7.5.18-19867135.run
检测CUDA 安装是否成功:
./cuda-samples.9.1.85-23083092-linux.run ./software/cuda-sample
cd /usr/local/cuda/samples
make
cd bin/x86_64/linux/release/
./deviceQuery
#出现一大段结果,最后是
Result = PASS
#代表通过了
#最后检查
./bandwidthTest
#显示Result = PASS
CuDNN下载需要注册账号,这里就不一一举例,自行去NVIDIA官网下载CuDNN的压缩文件:
解压到以下目录:
tar -xvf cudnn-8.0-linux-x64-v6.0.tgz -C /usr/local
ldconfig
bazel是Google的打包工具,到https://github.com/bazelbuild/bazel/releases下载对应的bazel 版本安装。
选择源码安装的方式,好处式可以自定义很多需求,谷歌给的编译好的wheel文件,是比较common的,很多功能我们并不需要,比如Google Cloud Platform support,也有一些很有用的功能考虑兼容性默认是关闭的,比如CPU加速计算指令,使用源码编译安装可以对这些进行更合理的选择。当然源码安装也有一个坏处,就是安装时间比较长。
进入TensorFlow源码目录,pyenv locan anaconda3*设置该目录的虚拟环境,保证我们的TensorFlow是安装在需要的虚拟环境里面去。
对应谷歌给出的目前编译安装通过的版本:
选择对应gcc与bazel版本。
本文使用以下脚本对TensorFlow进行编译安装:
#!/usr/bin/env bash
# Detect platform
if [ "$(uname)" == "Darwin" ]; then
# MacOS
raw_cpu_flags=`sysctl -a | grep machdep.cpu.features | cut -d ":" -f 2 | tr '[:upper:]' '[:lower:]'`
elif [ "$(uname)" == "Linux" ]; then
# GNU/Linux
raw_cpu_flags=`grep flags -m1 /proc/cpuinfo | cut -d ":" -f 2 | tr '[:upper:]' '[:lower:]'`
else
echo "Unknown plaform: $(uname)"
exit -1
fi
# check if VirtuelEnv activated
if [ -z "$VIRTUAL_ENV" ]; then
echo "VirtualEnv is not activated"
exit -1
fi
VENV_BIN=$VIRTUAL_ENV/bin
VENV_LIB=$VIRTUAL_ENV/lib
# bazel tf needs these env vars
export PYTHON_BIN_PATH=$VENV_BIN/python
export PYTHON_LIB_PATH=$VENV_LIB/`ls $VENV_LIB | grep python | grep -v lib`
COPT="--copt=-march=native"
for cpu_feature in $raw_cpu_flags
do
case "$cpu_feature" in
"sse4.1" | "sse4.2" | "ssse3" | "fma" | "cx16" | "popcnt" | "maes")
COPT+=" --copt=-m$cpu_feature"
;;
"avx1.0")
COPT+=" --copt=-mavx"
;;
*)
# noop
;;
esac
done
bazel clean
./configure
bazel build -c opt $COPT -k //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install --upgrade /tmp/tensorflow_pkg/`ls /tmp/tensorflow_pkg/ | grep tensorflow`
该脚本支持mac, linux。脚本启动之后,会有以下选项提示:
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
WARNING: Running Bazel server needs to be killed, because the startup options are different.
You have bazel 0.13.0 installed.
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
No jemalloc as malloc support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
No Apache Kafka Platform support will be enabled for TensorFlow.
Do you wish to build TensorFlow with XLA JIT support? [y/N]: y
XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.
Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1
Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.1
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 1.3]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.0,6.0]
Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
Configuration finished
Starting local Bazel server and connecting to it...
主要需要开启上面的几个选项:
1. JIT #开启Intel CPU 加速计算指令集
2. CUDA #版本号输入9.1
3. CuDNN #版本号输入7.1
4. #其它选项如上面的配置log所示
运行以下代码,如果log中显示有GPU,说明安装成功:
import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
运行结果如下图所示:
说明安装成功。
Keras是在TensorFlow以及PyTorch(未来PyTorch会和Caffe2进一步合并)上面更抽象的一层,使用了layer的概念实现了很多深度学习的基本模型结构,比如Embedding, Pooling, DropOut, Dense等,也可以直接调用TensorFlow或者PyTorch的backend(二选一),使得深度学习的编码难度大幅降低。
pip install keras
好了,你可以开始愉快的机器学习之旅了~