初探TVM--通过TVM的python接口编译优化resnet50

通过TVM的python接口编译优化resnet50

  • 通过TVM的python接口编译优化resnet50
    • 下载并加在onnx模型
    • 下载和预处理图片
    • 用relay接口编译模型
    • 基于TVM运行时组件执行编译后的模型
    • 基础优化的数据
    • 后处理输出数据
    • 用autoTVM调优模型
    • 指定目标平台

通过TVM的python接口编译优化resnet50

在上一章的教程中1,我们通过tvmc这个command line工具优化并且调优(tune)了一个预训练好的视觉模型–resnet50 v2. 不过tvm本身也有一套基于python的API,他们在优化深度学习模型的工作上提供了强大的灵活性。
在本章的教程中,我们会继续使用在tvmc教程中的背景,但是主要使用python API来完成任务,而不是使用TVMC工具。 在本章节的教程中,我们会使用TVM的python API实现以下任务功能:

  • 基于tvm运行时组件,编译一个预训练好的resnet 50 v2模型
  • 通过编译过的模型,在一张真实图片上跑出运算结果
  • 在CPU上对模型调优
  • 编译基于调优过的模型
  • 在真实图片上,运行调优过的模型,获取输出

本章节的目标是对tvm的使用场景有一个大体的理解,并且能够知道怎样使用tvm的python API来完成一个模型编译和优化的任务。

使用python API写脚本的话,首先会需要import很多必须的库,例如onnxnumpy等等。

import onnx
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np
import tvm.relay as relay
import tvm
from tvm.contrib import graph_executor

下载并加在onnx模型

在本章节中,我们仍然使用resnet 50 v2,这是一个有50层卷积的图片分类模型。模型通过超过100万张图片和1000中不同分类做过预训练,需要输入的图片分辨率为224x224。由于之前内容对resnet50做过介绍,这里就不再多说了。
TVM封装了一个库可以用来下载预训练的模型,只需要提供模型地址,类型等信息,TVM可以有API来完成模型下载和保存。

model_url = "".join(
    [
        "https://github.com/onnx/models/raw/",
        "master/vision/classification/resnet/model/",
        "resnet50-v2-7.onnx",
    ]
)

model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx")
onnx_model = onnx.load(model_path)

下载和预处理图片

和前面章节一样,我们还会再把这只可爱的小猫咪拉出来分类(写到这里,又想我们家提米了)。
初探TVM--通过TVM的python接口编译优化resnet50_第1张图片
下载和预处理的代码在下面啦:

img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")

# Resize it to 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")

# Our input image is in HWC layout while ONNX expects CHW input, so convert the array
img_data = np.transpose(img_data, (2, 0, 1))

# Normalize according to the ImageNet input specification
imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev

# Add the batch dimension, as we are expecting 4-dimensional input: NCHW.
img_data = np.expand_dims(norm_img_data, axis=0)

用relay接口编译模型

接下来我们就可以编译这个resnet50的模型了。首先使用relay的onnx模块导入,然后用标准优化流程编译,最后会创建出一个TVM的图运行时模块:

target = "llvm"
# The input name may vary across model types. You can use a tool
# like Netron to check input names
input_name = "data"
shape_dict = {input_name: img_data.shape}

mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)

with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))

如果能够在指定target时给出准确的平台信息,就能够获得更好的性能,因为TVM内部会利用平台特性,做出相应的优化策略,例如target = "llvm -mcpu=skylake" 或者 target="llvm -mcpu=skylake-avx512",这样就可以利用X86的avx512指令集优化。

基于TVM运行时组件执行编译后的模型

在编译好模型后,就可以用tvm 运行时组件做推理运算了。

dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

不知道这里换成float16会不会也可以

基础优化的数据

在调优模型之前,我们先看一下基础的模型的优化数据。为了是测试准确,我们多次运行模型,计算平均计算时间。

import timeit

timing_number = 10
timing_repeat = 10
unoptimized = (
    np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
    * 1000
    / timing_number
)
unoptimized = {
    "mean": np.mean(unoptimized),
    "median": np.median(unoptimized),
    "std": np.std(unoptimized),
}

print(unoptimized)

在我用的机器上面,耗时:

{'mean': 22.72307151928544, 'median': 22.025499097071588, 'std': 1.3807440805647897}

后处理输出数据

与之前的方式一样,我们用model zoo里面的结果内容对模型输出做个后处理:

from scipy.special import softmax

# Download a list of labels
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")

with open(labels_path, "r") as f:
    labels = [l.rstrip() for l in f]

# Open the output and read the output tensor
scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
    print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

可以得到如下结果:

class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367179
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261

用autoTVM调优模型

与上一章节类似,我们可以使用autoTVM模块对模型调优,与之不同得是,这次使用python API来完成调优。

import tvm.auto_scheduler as auto_scheduler
from tvm.autotvm.tuner import XGBTuner
from tvm import autotvm

number = 10
repeat = 1
min_repeat_ms = 0  # since we're tuning on a CPU, can be set to 0
timeout = 10  # in seconds

# create a TVM runner
runner = autotvm.LocalRunner(
    number=number,
    repeat=repeat,
    timeout=timeout,
    min_repeat_ms=min_repeat_ms,
    enable_cpu_cache_flush=True,
)

tuning_option = {
    "tuner": "xgb",
    "trials": 10,
    "early_stopping": 100,
    "measure_option": autotvm.measure_option(
        builder=autotvm.LocalBuilder(build_func="default"), runner=runner
    ),
    "tuning_records": "resnet-50-v2-autotuning.json",
}

# begin by extracting the tasks from the onnx model
tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params)

# Tune the extracted tasks sequentially.
for i, task in enumerate(tasks):
    prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
    tuner_obj = XGBTuner(task, loss_type="rank")
    tuner_obj.tune(
        n_trial=min(tuning_option["trials"], len(task.config_space)),
        early_stopping=tuning_option["early_stopping"],
        measure_option=tuning_option["measure_option"],
        callbacks=[
            autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix),
            autotvm.callback.log_to_file(tuning_option["tuning_records"]),
        ],
    )

通过tuning 我们可以看到一步步调优的结果:

[Task  1/25]  Current/Best:  165.26/ 216.09 GFLOPS | Progress: (10/10) | 4.50 s Done.
[Task  2/25]  Current/Best:  150.98/ 192.87 GFLOPS | Progress: (10/10) | 3.50 s Done.
[Task  3/25]  Current/Best:  168.08/ 249.73 GFLOPS | Progress: (10/10) | 5.44 s Done.
[Task  4/25]  Current/Best:   95.23/ 196.60 GFLOPS | Progress: (10/10) | 7.79 s Done.
[Task  5/25]  Current/Best:  207.23/ 262.81 GFLOPS | Progress: (10/10) | 4.27 s Done.
[Task  6/25]  Current/Best:  132.89/ 550.37 GFLOPS | Progress: (10/10) | 7.08 s Done.
[Task  7/25]  Current/Best:  261.82/ 284.37 GFLOPS | Progress: (10/10) | 3.83 s Done.
[Task  8/25]  Current/Best:  257.41/ 433.27 GFLOPS | Progress: (10/10) | 3.97 s Done.
[Task  9/25]  Current/Best:  176.71/ 211.27 GFLOPS | Progress: (10/10) | 10.51 s Done.
[Task 10/25]  Current/Best:  128.45/ 311.06 GFLOPS | Progress: (10/10) | 3.42 s Done.
[Task 11/25]  Current/Best:  211.18/ 284.83 GFLOPS | Progress: (10/10) | 3.99 s Done.
[Task 12/25]  Current/Best:  165.26/ 325.64 GFLOPS | Progress: (10/10) | 9.99 s Done.
[Task 13/25]  Current/Best:  261.55/ 328.09 GFLOPS | Progress: (10/10) | 5.42 s Done.
[Task 14/25]  Current/Best:  242.21/ 289.98 GFLOPS | Progress: (10/10) | 9.33 s Done.
[Task 15/25]  Current/Best:  231.47/ 241.25 GFLOPS | Progress: (10/10) | 9.91 s Done.
[Task 16/25]  Current/Best:  271.65/ 271.65 GFLOPS | Progress: (10/10) | 3.84 s Done.
[Task 17/25]  Current/Best:  245.57/ 245.57 GFLOPS | Progress: (10/10) | 4.32 s Done.
[Task 18/25]  Current/Best:  292.00/ 381.25 GFLOPS | Progress: (10/10) | 4.28 s Done.
[Task 19/25]  Current/Best:   79.44/ 441.70 GFLOPS | Progress: (10/10) | 4.42 s Done.
[Task 20/25]  Current/Best:  516.32/ 541.19 GFLOPS | Progress: (10/10) | 13.30 s Done.
[Task 21/25]  Current/Best:  414.80/ 449.63 GFLOPS | Progress: (10/10) | 4.99 s Done.
[Task 22/25]  Current/Best:   16.18/ 490.37 GFLOPS | Progress: (10/10) | 5.87 s Done.
[Task 23/25]  Current/Best:  443.06/ 573.66 GFLOPS | Progress: (10/10) | 4.92 s Done.
[Task 24/25]  Current/Best:    5.43/  80.01 GFLOPS | Progress: (10/10) | 13.03 s Done.
[Task 25/25]  Current/Best:   19.45/  24.78 GFLOPS | Progress: (10/10) | 13.21 s Done.

获取到tuning的param后,需要依据log,重新编译模型模型,再次运行并测试耗时:

with autotvm.apply_history_best(tuning_option["tuning_records"]):
    with tvm.transform.PassContext(opt_level=3, config={}):
        lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))

dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
    print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

测试调优后的耗时:

import timeit

timing_number = 10
timing_repeat = 10
optimized = (
    np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
    * 1000
    / timing_number
)
optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)}


print("optimized: %s" % (optimized))
print("unoptimized: %s" % (unoptimized))

能够得到如下结果:

optimized: {'mean': 24.068975364789367, 'median': 23.36287514772266, 'std': 1.595523633701862}
unoptimized: {'mean': 22.72307151928544, 'median': 22.025499097071588, 'std': 1.3807440805647897}

tune过之后比之前还慢,猜测可能是tuning的次数太少,甚至没有搜索到tune之前的参数就结束了,加大些tune的步骤先试试:

#small change of tuning option:
tuning_option = {
    "tuner": "xgb",
    "trials": 1000,
    "early_stopping": 1000,
    "measure_option": autotvm.measure_option(
        builder=autotvm.LocalBuilder(build_func="default"), runner=runner
    ),
    "tuning_records": "resnet-50-v2-autotuning.json",
}

果然快了一些,但是不多。。。

optimized: {'mean': 17.606267603114247, 'median': 17.284288653172553, 'std': 0.6380016394113105}
unoptimized: {'mean': 23.359683020971715, 'median': 22.71405295468867, 'std': 1.538688442981107}

指定目标平台

来个重点吧
tuning这个东西,跟编译出来的code object关系很大,如果能够指定一个CPU来编译的话,真的效果很好,可能llvm里面会有很多相关的pass吧。下面我就把target改成我的目标机器。
首先查看一下自己的目标机器:

llc-12 --version

会输出下面一段儿东西:

LLVM (http://llvm.org/):
  LLVM version 12.0.0
  
  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: znver2

  Registered Targets:
    aarch64    - AArch64 (little endian)
    aarch64_32 - AArch64 (little endian ILP32)
    ...

重点就是:Host CPU: znver2。然后改target:

target = "llvm -mcpu=znver2"

重新优化编译过,果然快了非常多:

optimized: {'mean': 9.935215310179046, 'median': 9.931096900618286, 'std': 0.04295490942983339}

质变啊,xdm。


  1. 使用TVMC优化resnet50【1】【2】 ↩︎

你可能感兴趣的:(tvm学习,python,深度学习,计算机视觉)