对于大部分机器学习爱好者来说,TensorFlow(以下简写为TF)是一个非常好的Python开源机器学习框架。但对于一部分开发者而言,他们经常需要在Python环境下训练自己的模型,然后在C++环境下进行部署。这个部署、测试用的环境常常用Docker。
本文针对这样的需求,结合我自己在过程中踩过的坑、熬过的夜,记录一下如何搭建一个带有TensorFlow C++环境的Docker image。文章包含以下内容:
据笔者观察,CPU线程数越多的电脑编译越快。有条件的话,在一个CPU核数多、线程数多的电脑上编译会省一点时间。
TF C++ API目前只能通过源码编译,所以先从网上把repo拉下来:
# 从github上克隆TensorFlow至tensorflow_src文件夹
$ git clone https://github.com/tensorflow/tensorflow.git tensorflow_src
# 转到你需要的版本。这里以r1.14为例。其他版本可能需要相应调整你的CUDA、CUDNN、NCCL、BAZEL版本
$ cd tensorflow_src
$ git checkout r1.14
接下来进行编译的配置。
$ ./configure
You have bazel 0.26.1 installed.
# 指定Python环境,可以是系统自带,也可以是virtualenv/conda等虚拟环境下的Python可执行文件
Please specify the location of python. [Default is /usr/bin/python]:
# 指定Python包的安装位置
Please input the desired Python library path to use. Default is [/usr/local/lib/python3.5/site-packages]
# 指定是否打开XLA(加速线性代数)支持,默认为Yes
Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.
# 指定是否打开OpenCL SYCL支持,默认为No
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.
# 指定是否打开ROCm支持,默认为No
Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.
# 指定是否打开CUDA,默认为No,要设为Yes,因为要用到GPU
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
CUDA support will be enabled for TensorFlow.
# 指定是否打开TensorRT支持,默认为No
Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.
# 如果cuDNN和CUDA安装路径不一致,还得提供cuDNN的安装路径。输入一串用逗号分隔开的路径,确保cuDNN的头文件和动态链接库在这些路径下能被找到即可。
Found CUDA 10.0 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 7 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 7.5]: # 指定CUDA计算版本。直接取默认版本即可
Do you want to use clang as CUDA compiler? [y/N]: # 指定是否使用clang进行CUDA编译。默认为No
nvcc will be used as CUDA compiler.
# 指定gcc版本。直接取默认gcc即可
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
# 指定是否打开MPI支持。默认为No
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.
# 指定bazel编译时的flag,可以直接取默认值,后期需要的话还可以继续加
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:
# 不需要Android,默认为No
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details. # 以下在bazel build的时候可以作为flag加入命令行
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished
配置完成之后,开始实际的编译:
$ bazel build --config=opt --config=cuda --action_env="LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" //tensorflow:libtensorflow_cc.so
# 添加--config=monolithic会让最终的编译结果只有libtensorflow_cc.so这个库(可以减少生成的库的数量,方便linking等),不加的话还会一起产生一个libtensorflow_framework.so库。在可能遇到的问题中,还会讲到这个选项带来的好坏
编译可能会占用比较长的时间,按照笔者自己编译的结果,10核20线程的CPU大概要花半小时,6核12线程的CPU大概要50分钟,仅供参考。这期间你可以出去吃个饭。
编译完成后,要把一些头文件、链接库拷贝到一个合适的地方。
# 以/usr/local/include/tf为根目录为例
$ sudo mkdir -p /usr/local/include/tf/tensorflow
$ sudo cp -r bazel-genfiles/ /usr/local/include/tf
$ sudo cp -r tensorflow/cc /usr/local/include/tf/tensorflow
$ sudo cp -r tensorflow/core /usr/local/include/tf/tensorflow
$ sudo cp -r third_party /usr/local/include/tf
$ sudo cp -r bazel-bin/tensorflow/libtensorflow* /usr/local/lib # 把链接库单独放到/usr/local/lib下
# 完成后的目录结构应该如下:
# _/usr/local/include/
# |_tf/
# |_tensorflow/
# | |_cc/
# | |_core/
# |_bazel-genfiles/
# |_third_party/
最后做一些post-installation的工作,添加TF需要的第三方dependency。刚才拷贝的third_party里面的库依赖是不完全的。TF已经帮我们准备好了一个脚本做这件事情。
# 在tensorflow_src目录下
# 执行下载第三方库依赖的脚本,这会生成一个downloads的文件夹,里面全是第三方库
$ ./tensorflow/contrib/makefile/download_dependencies.sh
# 将downloads里面的库全部搬移到/usr/local/include下。你也可以选一个你自己觉得合适的地方。
# 笔者将他们全部移到/usr/local/include的目的是因为在编译自己的项目时可以少写几个include的路径
$ sudo cp -r tensorflow/contrib/makefile/downloads/* /usr/local/include
# TF r1.14的proto文件需要protobuf3.7.1的支持。在这个第三方库里,我们把版本checkout到3.7.1
$ cd /usr/local/include/protobuf
$ git checkout v3.7.1
# libtensorflow_framework.so如果在/usr/local/lib下不存在的话,符号链接到libtensorflow_framework.so.1
$ cd /usr/local/lib
$ ln -s /libtensorflow_framework.so.1 libtensorflow_framework.so
至此,我们的TF手动安装就完成了。
基本思路是在写Dockerfile时,把上述的安装步骤写成一个bash脚本,然后在Dockerfile中用RUN
命令去执行这个脚本。为了跳过configure时需要的一些互动步骤,我们在bash脚本里定义需要的一些环境变量:
# 从git上拉下来TF的repo,checkout到r1.14。这里略过
# Python path options
export PYTHON_BIN_PATH=$(which python3)
export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"
# Compilation parameters
export TF_NEED_CUDA=1
export TF_NEED_GCP=0
export TF_CUDA_COMPUTE_CAPABILITIES=5.2,6.1,7.0,7.5
export TF_NEED_HDFS=0
export TF_NEED_OPENCL=0
export TF_NEED_JEMALLOC=0
export TF_ENABLE_XLA=1
export TF_NEED_VERBS=0
export TF_CUDA_CLANG=0
export TF_DOWNLOAD_CLANG=0
export TF_NEED_MKL=0
export TF_DOWNLOAD_MKL=0
export TF_NEED_MPI=0
export TF_NEED_S3=0
export TF_NEED_KAFKA=0
export TF_NEED_GDR=0
export TF_NEED_OPENCL_SYCL=0
export TF_SET_ANDROID_WORKSPACE=0
export TF_NEED_AWS=0
export TF_NEED_IGNITE=0
export TF_NEED_ROCM=0
# 编译器参数
export GCC_HOST_COMPILER_PATH=$(which gcc)
# bazel编译优化用参数
export CC_OPT_FLAGS="-march=native"
# CUDA与cuDNN参数
export CUDA_TOOLKIT_PATH=$CUDA_HOME
export CUDNN_INSTALL_PATH="/usr/include,/usr/lib/x86_64-linux-gnu" # 一个用逗号分隔开的路径字符串,包括了头文件的位置和.so文件的位置,如果和CUDA安装位置不同
export TF_CUDA_VERSION=10.0 # CUDA版本
export TF_CUDNN_VERSION=7.6 # cuDNN版本,直接写7也可以,如果你知道具体的版本号,如7.6.1,也可以
export TF_NEED_TENSORRT=0
export TF_NCCL_VERSION=2.4 # NCCL版本,同cuDNN版本的指定类似
export NCCL_INSTALL_PATH=$CUDA_HOME
# Those two lines are important for the linking step.
export LD_LIBRARY_PATH="$CUDA_TOOLKIT_PATH/lib64:${LD_LIBRARY_PATH}"
ldconfig
./configure
bazel build --config=opt --config=cuda --action_env="LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" //tensorflow:libtensorflow_cc.so
# 在之后添加第一部分提到的拷贝、下载第三方之类的语句。这里略过
这里提供一个不错的现成github repo链接: https://github.com/lysukhin/tensorflow-object-detection-cpp,我们可以用这里面的代码来测试安装。代码本身会读取摄像头数据,并识别视频帧中的人手。需要C++ OpenCV的支持。没有OpenCV的朋友可以重点看一下里头CMakeLists.txt
的写法。
$ git clone https://github.com/lysukhin/tensorflow-object-detection-cpp tf_test # 重命名为tf_test文件夹
$ cd tf_test && mkdir build
接下来我们要修改CMakeLists.txt
中一些路径,指向我们需要的一些头文件:
# 新的CMakeLists.txt,用这个替代掉原有的文件
cmake_minimum_required(VERSION 3.7)
project(tf_detector_example)
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -g -Wall")
# Use this one if you want to do video real time detection
set(SOURCE_FILES do_infer.cc)
add_executable(tf_detector_example ${
SOURCE_FILES})
# OpenCV libs
find_package(OpenCV REQUIRED)
include_directories(${
OpenCV_INCLUDE_DIRS})
target_link_libraries(tf_detector_example ${
OpenCV_LIBS})
# ==================== PATHS TO SPECIFY! ==================== #
# TensorFlow需要的第三方库路径。eigen、absl和protobuf没有明确指出的话可能会报no such file or directory的错
include_directories("/usr/local/include")
include_directories("/usr/local/include/eigen")
include_directories("/usr/local/include/absl")
include_directories("/usr/local/include/protobuf/src")
# TensorFlow 头文件路径
include_directories("/usr/local/include/tf/")
include_directories("/usr/local/include/tf/bazel-genfiles/")
include_directories("/usr/local/include/tf/tensorflow/")
include_directories("/usr/local/include/tf/third-party/")
# TensorFlow 动态链接库路径
target_link_libraries(tf_detector_example "/usr/local/lib/libtensorflow_cc.so")
target_link_libraries(tf_detector_example "/usr/local/lib/libtensorflow_framework.so")
完成后我们可以编译运行了:
$ cd build && cmake ..
$ make
$ ./tf_detector_example
如果能正确链接、运行,那恭喜你,万事大吉了。
This file was generated by an older version of protoc which is...
或This file was generated by an newer version of protoc which is...
这些文件由protobuf生成,你可以打开它们,在头上查看宏里显示的protobuf版本为何。若版本号显示为3007001
,那么就对应是3.7.1版。相应地,再去protobuf/
文件夹下checkout一个版本就好。bazel build
时添加了--config=monolithic
造成的。去掉这个flag重新编译可以解决这个问题,但也有地方指出不加这个flag会和OpenCV产生冲突,详见https://github.com/tensorflow/tensorflow/issues/14826。笔者自己目前的build是不带这个flag的,暂时未出现问题,还有待观察。2019年08月13日