在上一章的教程中1,我们通过tvmc这个command line工具优化并且调优(tune)了一个预训练好的视觉模型–resnet50 v2. 不过tvm本身也有一套基于python的API,他们在优化深度学习模型的工作上提供了强大的灵活性。
在本章的教程中,我们会继续使用在tvmc教程中的背景,但是主要使用python API来完成任务,而不是使用TVMC工具。 在本章节的教程中,我们会使用TVM的python API实现以下任务功能:
本章节的目标是对tvm的使用场景有一个大体的理解,并且能够知道怎样使用tvm的python API来完成一个模型编译和优化的任务。
使用python API写脚本的话,首先会需要import很多必须的库,例如onnx
、numpy
等等。
import onnx
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np
import tvm.relay as relay
import tvm
from tvm.contrib import graph_executor
在本章节中,我们仍然使用resnet 50 v2,这是一个有50层卷积的图片分类模型。模型通过超过100万张图片和1000中不同分类做过预训练,需要输入的图片分辨率为224x224。由于之前内容对resnet50做过介绍,这里就不再多说了。
TVM封装了一个库可以用来下载预训练的模型,只需要提供模型地址,类型等信息,TVM可以有API来完成模型下载和保存。
model_url = "".join(
[
"https://github.com/onnx/models/raw/",
"master/vision/classification/resnet/model/",
"resnet50-v2-7.onnx",
]
)
model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx")
onnx_model = onnx.load(model_path)
和前面章节一样,我们还会再把这只可爱的小猫咪拉出来分类(写到这里,又想我们家提米了)。
下载和预处理的代码在下面啦:
img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")
# Resize it to 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")
# Our input image is in HWC layout while ONNX expects CHW input, so convert the array
img_data = np.transpose(img_data, (2, 0, 1))
# Normalize according to the ImageNet input specification
imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev
# Add the batch dimension, as we are expecting 4-dimensional input: NCHW.
img_data = np.expand_dims(norm_img_data, axis=0)
接下来我们就可以编译这个resnet50的模型了。首先使用relay的onnx模块导入,然后用标准优化流程编译,最后会创建出一个TVM的图运行时模块:
target = "llvm"
# The input name may vary across model types. You can use a tool
# like Netron to check input names
input_name = "data"
shape_dict = {input_name: img_data.shape}
mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)
dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))
如果能够在指定target时给出准确的平台信息,就能够获得更好的性能,因为TVM内部会利用平台特性,做出相应的优化策略,例如target = "llvm -mcpu=skylake"
或者 target="llvm -mcpu=skylake-avx512"
,这样就可以利用X86的avx512指令集优化。
在编译好模型后,就可以用tvm 运行时组件做推理运算了。
dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()
不知道这里换成float16会不会也可以
在调优模型之前,我们先看一下基础的模型的优化数据。为了是测试准确,我们多次运行模型,计算平均计算时间。
import timeit
timing_number = 10
timing_repeat = 10
unoptimized = (
np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
* 1000
/ timing_number
)
unoptimized = {
"mean": np.mean(unoptimized),
"median": np.median(unoptimized),
"std": np.std(unoptimized),
}
print(unoptimized)
在我用的机器上面,耗时:
{'mean': 22.72307151928544, 'median': 22.025499097071588, 'std': 1.3807440805647897}
与之前的方式一样,我们用model zoo里面的结果内容对模型输出做个后处理:
from scipy.special import softmax
# Download a list of labels
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")
with open(labels_path, "r") as f:
labels = [l.rstrip() for l in f]
# Open the output and read the output tensor
scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
可以得到如下结果:
class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367179
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261
与上一章节类似,我们可以使用autoTVM模块对模型调优,与之不同得是,这次使用python API来完成调优。
import tvm.auto_scheduler as auto_scheduler
from tvm.autotvm.tuner import XGBTuner
from tvm import autotvm
number = 10
repeat = 1
min_repeat_ms = 0 # since we're tuning on a CPU, can be set to 0
timeout = 10 # in seconds
# create a TVM runner
runner = autotvm.LocalRunner(
number=number,
repeat=repeat,
timeout=timeout,
min_repeat_ms=min_repeat_ms,
enable_cpu_cache_flush=True,
)
tuning_option = {
"tuner": "xgb",
"trials": 10,
"early_stopping": 100,
"measure_option": autotvm.measure_option(
builder=autotvm.LocalBuilder(build_func="default"), runner=runner
),
"tuning_records": "resnet-50-v2-autotuning.json",
}
# begin by extracting the tasks from the onnx model
tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params)
# Tune the extracted tasks sequentially.
for i, task in enumerate(tasks):
prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
tuner_obj = XGBTuner(task, loss_type="rank")
tuner_obj.tune(
n_trial=min(tuning_option["trials"], len(task.config_space)),
early_stopping=tuning_option["early_stopping"],
measure_option=tuning_option["measure_option"],
callbacks=[
autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix),
autotvm.callback.log_to_file(tuning_option["tuning_records"]),
],
)
通过tuning 我们可以看到一步步调优的结果:
[Task 1/25] Current/Best: 165.26/ 216.09 GFLOPS | Progress: (10/10) | 4.50 s Done.
[Task 2/25] Current/Best: 150.98/ 192.87 GFLOPS | Progress: (10/10) | 3.50 s Done.
[Task 3/25] Current/Best: 168.08/ 249.73 GFLOPS | Progress: (10/10) | 5.44 s Done.
[Task 4/25] Current/Best: 95.23/ 196.60 GFLOPS | Progress: (10/10) | 7.79 s Done.
[Task 5/25] Current/Best: 207.23/ 262.81 GFLOPS | Progress: (10/10) | 4.27 s Done.
[Task 6/25] Current/Best: 132.89/ 550.37 GFLOPS | Progress: (10/10) | 7.08 s Done.
[Task 7/25] Current/Best: 261.82/ 284.37 GFLOPS | Progress: (10/10) | 3.83 s Done.
[Task 8/25] Current/Best: 257.41/ 433.27 GFLOPS | Progress: (10/10) | 3.97 s Done.
[Task 9/25] Current/Best: 176.71/ 211.27 GFLOPS | Progress: (10/10) | 10.51 s Done.
[Task 10/25] Current/Best: 128.45/ 311.06 GFLOPS | Progress: (10/10) | 3.42 s Done.
[Task 11/25] Current/Best: 211.18/ 284.83 GFLOPS | Progress: (10/10) | 3.99 s Done.
[Task 12/25] Current/Best: 165.26/ 325.64 GFLOPS | Progress: (10/10) | 9.99 s Done.
[Task 13/25] Current/Best: 261.55/ 328.09 GFLOPS | Progress: (10/10) | 5.42 s Done.
[Task 14/25] Current/Best: 242.21/ 289.98 GFLOPS | Progress: (10/10) | 9.33 s Done.
[Task 15/25] Current/Best: 231.47/ 241.25 GFLOPS | Progress: (10/10) | 9.91 s Done.
[Task 16/25] Current/Best: 271.65/ 271.65 GFLOPS | Progress: (10/10) | 3.84 s Done.
[Task 17/25] Current/Best: 245.57/ 245.57 GFLOPS | Progress: (10/10) | 4.32 s Done.
[Task 18/25] Current/Best: 292.00/ 381.25 GFLOPS | Progress: (10/10) | 4.28 s Done.
[Task 19/25] Current/Best: 79.44/ 441.70 GFLOPS | Progress: (10/10) | 4.42 s Done.
[Task 20/25] Current/Best: 516.32/ 541.19 GFLOPS | Progress: (10/10) | 13.30 s Done.
[Task 21/25] Current/Best: 414.80/ 449.63 GFLOPS | Progress: (10/10) | 4.99 s Done.
[Task 22/25] Current/Best: 16.18/ 490.37 GFLOPS | Progress: (10/10) | 5.87 s Done.
[Task 23/25] Current/Best: 443.06/ 573.66 GFLOPS | Progress: (10/10) | 4.92 s Done.
[Task 24/25] Current/Best: 5.43/ 80.01 GFLOPS | Progress: (10/10) | 13.03 s Done.
[Task 25/25] Current/Best: 19.45/ 24.78 GFLOPS | Progress: (10/10) | 13.21 s Done.
获取到tuning的param后,需要依据log,重新编译模型模型,再次运行并测试耗时:
with autotvm.apply_history_best(tuning_option["tuning_records"]):
with tvm.transform.PassContext(opt_level=3, config={}):
lib = relay.build(mod, target=target, params=params)
dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))
dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()
scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
测试调优后的耗时:
import timeit
timing_number = 10
timing_repeat = 10
optimized = (
np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
* 1000
/ timing_number
)
optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)}
print("optimized: %s" % (optimized))
print("unoptimized: %s" % (unoptimized))
能够得到如下结果:
optimized: {'mean': 24.068975364789367, 'median': 23.36287514772266, 'std': 1.595523633701862}
unoptimized: {'mean': 22.72307151928544, 'median': 22.025499097071588, 'std': 1.3807440805647897}
tune过之后比之前还慢,猜测可能是tuning的次数太少,甚至没有搜索到tune之前的参数就结束了,加大些tune的步骤先试试:
#small change of tuning option:
tuning_option = {
"tuner": "xgb",
"trials": 1000,
"early_stopping": 1000,
"measure_option": autotvm.measure_option(
builder=autotvm.LocalBuilder(build_func="default"), runner=runner
),
"tuning_records": "resnet-50-v2-autotuning.json",
}
果然快了一些,但是不多。。。
optimized: {'mean': 17.606267603114247, 'median': 17.284288653172553, 'std': 0.6380016394113105}
unoptimized: {'mean': 23.359683020971715, 'median': 22.71405295468867, 'std': 1.538688442981107}
来个重点吧
tuning这个东西,跟编译出来的code object关系很大,如果能够指定一个CPU来编译的话,真的效果很好,可能llvm里面会有很多相关的pass吧。下面我就把target改成我的目标机器。
首先查看一下自己的目标机器:
llc-12 --version
会输出下面一段儿东西:
LLVM (http://llvm.org/):
LLVM version 12.0.0
Optimized build.
Default target: x86_64-pc-linux-gnu
Host CPU: znver2
Registered Targets:
aarch64 - AArch64 (little endian)
aarch64_32 - AArch64 (little endian ILP32)
...
重点就是:Host CPU: znver2。然后改target:
target = "llvm -mcpu=znver2"
重新优化编译过,果然快了非常多:
optimized: {'mean': 9.935215310179046, 'median': 9.931096900618286, 'std': 0.04295490942983339}
质变啊,xdm。
使用TVMC优化resnet50【1】【2】 ↩︎