所谓TVM,按照正式说法:就是一种将深度学习工作负载部署到硬件的端到端IR(中间表示)堆栈。换一种说法,可以表述为一种把深度学习模型分发到各种硬件设备上的、端到端的解决方案,关于更多TVM的信息大家可以参考TVM主页。
我们在端上进行CNN部署的时候,为了最大化的发挥硬件的性能,之前的框架多是用的手工调优的算法Op,如NCNN、Tengine、Feather等等;TVM可谓是另辟蹊径,让框架去自适应的寻找最优或次优的算子Op,这类的框架应该是后续发展的主流。
如果大家看过我之前写的博客,也许大概有个印象,一般在嵌入式设备中要进行CNN的部署,一般我都会进行如下几个步骤:模型设计、模型训练、模型裁剪、在线部署。这里我们不再逐个展开去讨论, 我们主要讨论如何进行在线部署,有关Mobilenet-ssd的前三步的讨论和步骤,大家可以参考本专栏之前的几篇文章。
如何使用TVM进行在线部署呢?抽象来说,我们需要进行如下几个步骤:1. 生成端到端的IR堆栈 2. 使用TVM-runtime进行推断。其中对性能影响最大的步骤为第一步,也就是我们前面所说的,自适应的寻找最优或者次优的算子Op。TVM官方给出的文档有两种策略,一种是使用tvm针对平台已经调教过的配置,另外一种进行AutoTVM,用优化和搜索的策略针对目标平台寻找最优或者次优的网络前传表示。不过第二种方法,在android目标平台中,需要先采用RPC让目标平台跟上位机建立通讯联系。下面我们开始进行Mobilenet-SSD的部署。
# tvm, relay
import tvm
from tvm import relay
# os and numpy
import numpy as np
import os.path
# Tensorflow imports
import tensorflow as tf
# Tensorflow utility functions
import tvm.relay.testing.tf as tf_testing
from PIL import Image
from tvm.contrib import graph_runtime
from tvm.contrib import util, ndk, graph_runtime as runtime
model_name = "./tflite_graph.pb"
arch = 'arm64'
# target_host = 'llvm -target=%s-linux-android' % arch
# target = 'opencl'
# target = tvm.target.mali(model='rk3399')
# target_host = tvm.target.arm_cpu(model='rk3399')
target_host = None
target = tvm.target.arm_cpu(model = 'rk3399')
# target_host = None
# target = 'llvm -target=%s-linux-android' % arch
layout = 'NCHW'
with tf.gfile.FastGFile(model_name, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
graph = tf.import_graph_def(graph_def, name='')
graph_def = tf_testing.ProcessGraphDefParam(graph_def)
shape_dict = {'normalized_input_image_tensor': (1, 300, 300, 3)}
dtype_dict = {'normalized_input_image_tensor': 'float32'}
mod, params = relay.frontend.from_tensorflow(
graph_def,
layout = layout,
shape = shape_dict,
outputs = ['raw_outputs/box_encodings', 'raw_outputs/class_predictions']
)
with relay.build_config(opt_level=3):
graph, lib, params = relay.build(
mod[mod.entry_func],
target=target,
target_host=target_host,
params=params
)
fcompile = ndk.create_shared
lib.export_library("./tvm_android/tvm_lib/deploy_lib.so", fcompile)
with open("./tvm_android/tvm_lib/deploy_graph.json", "w") as fo:
fo.write(graph)
with open("./tvm_android/tvm_lib/deploy_param.params", "wb") as fo:
fo.write(relay.save_param_dict(params))
大家参考上面代码可以发现,由于我们采用的硬件平台为RK3399,这里我们采用的配置文件,使用的是TVM提供的tvm.target.arm_cpu(model =‘rk3399’)。输出的是’raw_outputs/box_encodings’,'raw_outputs/class_predictions’两个节点。由于我们是在android中进行部署,所以要注意导出IR文件和模型的时候,加上ndk.create_shared这个flag。最后我们完成编译时,会得到deploy_lib.so、deploy_graph.json、deploy_param.param三个文件。
LOCAL_PATH := $(call my-dir)
OpenCV_BASE = /Users/xindongzhang/armnn-tflite/OpenCV-android-sdk/
TVM_BASE = /Users/xindongzhang/tvm/
include $(CLEAR_VARS)
LOCAL_MODULE := OpenCL
LOCAL_SRC_FILES := /Users/xindongzhang/Desktop/tvm-ssd/tvm_android/jni/libOpenCL.so
include $(PREBUILT_SHARED_LIBRARY)
include $(CLEAR_VARS)
OpenCV_INSTALL_MODULES := on
OPENCV_LIB_TYPE := STATIC
include $(OpenCV_BASE)/sdk/native/jni/OpenCV.mk
LOCAL_MODULE := tvm_mssd
LOCAL_C_INCLUDES += $(OPENCV_INCLUDE_DIR)
LOCAL_C_INCLUDES += $(TVM_BASE)/include
LOCAL_C_INCLUDES += $(TVM_BASE)/3rdparty/dlpack/include
LOCAL_C_INCLUDES += $(TVM_BASE)/3rdparty/dmlc-core/include
LOCAL_C_INCLUDES += $(TVM_BASE)/3rdparty/HalideIR/src
LOCAL_C_INCLUDES += $(TVM_BASE)/topi/include
LOCAL_SRC_FILES := \
main.cpp \
$(TVM_BASE)/src/runtime/c_runtime_api.cc \
$(TVM_BASE)/src/runtime/cpu_device_api.cc \
$(TVM_BASE)/src/runtime/workspace_pool.cc \
$(TVM_BASE)/src/runtime/module_util.cc \
$(TVM_BASE)/src/runtime/system_lib_module.cc \
$(TVM_BASE)/src/runtime/module.cc \
$(TVM_BASE)/src/runtime/registry.cc \
$(TVM_BASE)/src/runtime/file_util.cc \
$(TVM_BASE)/src/runtime/dso_module.cc \
$(TVM_BASE)/src/runtime/thread_pool.cc \
$(TVM_BASE)/src/runtime/threading_backend.cc \
$(TVM_BASE)/src/runtime/ndarray.cc \
$(TVM_BASE)/src/runtime/graph/graph_runtime.cc \
$(TVM_BASE)/src/runtime/opencl/opencl_device_api.cc \
$(TVM_BASE)/src/runtime/opencl/opencl_module.cc
LOCAL_LDLIBS := -landroid -llog -ldl -lz -fuse-ld=gold
LOCAL_CFLAGS := -O2 -fvisibility=hidden -fomit-frame-pointer -fstrict-aliasing \
-ffunction-sections -fdata-sections -ffast-math -ftree-vectorize \
-fPIC -Ofast -ffast-math -w -std=c++14
LOCAL_CPPFLAGS := -O2 -fvisibility=hidden -fvisibility-inlines-hidden \
-fomit-frame-pointer -fstrict-aliasing -ffunction-sections \
-fdata-sections -ffast-math -fPIC -Ofast -ffast-math -std=c++14
LOCAL_LDFLAGS += -Wl,--gc-sections
LOCAL_CFLAGS += -fopenmp
LOCAL_CPPFLAGS += -fopenmp
LOCAL_LDFLAGS += -fopenmp
LOCAL_ARM_NEON := true
APP_ALLOW_MISSING_DEPS = true
LOCAL_SHARED_LIBRARIES := \
OpenCL
include $(BUILD_EXECUTABLE)
集成了运行时的runtime环境后,我们就可以去写C++的业务代码了,我们这里仅考虑inference,不再去做postprocessing,关于postprocessing大家可以参考本专栏的另外几篇文章。业务代码比较简单,我就直接贴出来啦。
#include "tvm/runtime/c_runtime_api.h"
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
int main(void)
{
std::ifstream graph_file("./deploy_graph.json");
std::ifstream model_file("./deploy_param.params");
std::string graph_content(
(std::istreambuf_iterator(graph_file)),
std::istreambuf_iterator()
);
std::string model_params(
(std::istreambuf_iterator(model_file)),
std::istreambuf_iterator()
);
tvm::runtime::Module mod_dylib = tvm::runtime::Module::LoadFromFile("./deploy_lib.so");
tvm::runtime::Module mod_syslib = (*tvm::runtime::Registry::Get("module._GetSystemLib"))();
// int device_type = kDLCPU;
// int device_id = 0;
int device_type = kDLOpenCL;
int device_id = 0;
tvm::runtime::Module mod =
(*tvm::runtime::Registry::Get("tvm.graph_runtime.create"))
(graph_content.c_str(), mod_dylib, device_type, device_id);
int INPUT_SIZE = 300;
cv::Mat raw_image = cv::imread("./body.jpg");
int raw_image_height = raw_image.rows;
int raw_image_width = raw_image.cols;
cv::Mat image;
cv::resize(raw_image, image, cv::Size(INPUT_SIZE, INPUT_SIZE));
image.convertTo(image, CV_32FC3);
image = (image * 2.0f / 255.0f) - 1.0f;
TVMByteArray params;
params.data = reinterpret_cast(model_params.c_str());
params.size = model_params.size();
mod.GetFunction("load_params")(params);
std::vector input_shape = {1, 3, 300, 300};
DLTensor input;
input.ctx = DLContext{kDLOpenCL, 0};
// input.ctx = DLContext{kDLCPU, 0};
input.data = image.data;
input.ndim = 4;
input.dtype = DLDataType{kDLFloat, 32, 1};
input.shape = input_shape.data();
input.strides = nullptr;
input.byte_offset = 0;
//warm up
for (int i = 0; i < 3; ++i) {
mod.GetFunction("set_input")("normalized_input_image_tensor", &input);
mod.GetFunction("run")();
}
// cal time
int N = 10;
std::clock_t start = std::clock();
// mod.GetFunction("set_input")("normalized_input_image_tensor", &input);
for (int i = 0; i < N; ++i) {
mod.GetFunction("set_input")("normalized_input_image_tensor", &input);
mod.GetFunction("run")();
}
std::clock_t end = std::clock();
double duration = ( end - start ) * (1.0 / (double) N) / (double) CLOCKS_PER_SEC;
std::cout<< duration<< std::endl;
return 0;
}
至此,我们就使用TVM完成了Mobilenet-SSD在rk3399-android上的部署。
结尾
这篇文章主要分析了如何使用TVM在RK3399中部署CNN网络,我们这里选用了Mobilenet SSD检测器。
参考
使用NNAPI加速android-tflite的Mobilenet分类器
详解MNN的tf-MobilenetSSD-cpp部署流程
实战MNN之Mobilenet SSD部署(含源码)