移植TensorFlow Lite到ARM板i.MX6上

上一篇文章说到移植到LC1860C板上失败后,我又换了一块库更全更新的板子,继续大业。

运行label_image

 ./label_image -v 1 -m ./mobilenet_v1_1.0_224.tflite  -i ./grace_hopper.jpg-l ./imagenet_slim_labels.txt

alloc失败

遇到的第一个问题是alloc失败。

...
83: MobilenetV1/MobilenetV1/Conv2d_9_depthwise/weights_quant/FakeQuantWithMinMaxVars, 1152, 3, 0.0212288, 120
84: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Conv2D_Fold_bias, 512, 2, 0.000260965, 0
85: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6, 8192, 3, 0.0235285, 0
86: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/weights_quant/FakeQuantWithMinMaxVars, 16384, 3, 0.0110914, 146
87: MobilenetV1/Predictions/Reshape_1, 1001, 3, 0.00390625, 0
88: input, 49152, 3, 0.0078125, 128
len: 61306
width, height, channels: -16842752, 1766213120, 246279780
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

一开始我也没有在意,以为是板子太破,tensorflow lite这个label_image的例子耗内存太大,所以才挂的。准备自己新写一个简单的例子,来看看会不会挂。后来在学习tensorflow准备写例子的间隙里,发现log里的width, height, channels的值好像不对,怎么这么大还有负的。于是去仔细看了label_image的代码,发现执行到这里根本还没有invoke,也就是还没有开始跑tensorflow lite,再发现代码里面停在了read_bmp里面,才后知后觉发现我给的图片格式是jpg而不是bmp的,感紧换成bmp的,就没有这个alloc问题了。

  std::vector in = read_bmp(s->input_bmp_name, &image_width,
                                     &image_height, &image_channels, s);

illegal instruction

然后遇到的就是illegal instruction问题

...
Node  29 Operator Builtin Code  22
  Inputs: 1 5
  Outputs: 4
Node  30 Operator Builtin Code  25
  Inputs: 4
  Outputs: 87
Illegal instruction

以前并没有遇到过Illegal instruction的问题。一开始还以为是tensorflow报的log,在代码里面找了一圈没找到这个log,上网查了才知道这个是linux报的。通常是程序的某条指令板子上的CPU不识别,一般编译时候的架构选择的与板子上的ARM实际架构不符合不兼容导致的,所以还是环境的锅。
从网上知道这个illegal instruction其实是一种core dump,那就从core文件开始吧。
此处必须感谢:https://blog.csdn.net/chyxwzn/article/details/8879750?utm_source=tuicool

ulimit -c unlimited

重新跑一遍label_image得到core文件,上gdb

(gdb) bt
#0  0x0001f030 in tflite::optimized_ops::ResizeBilinear(tflite::ResizeBilinearParams const&, tflite::RuntimeShape const&, float const*, tflite::RuntimeShape const&, int const*, tflite::RuntimeShape const&, float*) ()
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p $pc
$1 = (void (*)()) 0x1f030 
(gdb) p $sp
$2 = (void *) 0x7e9e43c8
(gdb) x/5i $pc
=> 0x1f030 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7720>:
    vfma.f32    s14, s13, s15
   0x1f034 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7724>:
    vstmia      r2!, {s14}
   0x1f038 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7728>:
    bgt 0x1f01c <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7700>
   0x1f03c <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7732>:
    vorr        d30, d16, d16
   0x1f040 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7736>:
    vorr        d31, d17, d17

由此可以看出是挂在vfma.f32 s14, s13, s15命令上。看起来是我手上这个板子不支持这个VFM浮点操作。

先查编好的程序平台属性:

root@imx6dl-albatross2:~/march_build# readelf -A label_image
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv4
  Tag_Advanced_SIMD_arch: NEONv1 with Fused-MAC
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_align_preserved: 8-byte, except leaf SP
  Tag_ABI_enum_size: int
  Tag_ABI_VFP_args: VFP registers
  Tag_CPU_unaligned_access: v6
...

说明用的是VFPv4指令集。

再查看板子的情况:

root@imx6dl-albatross2:~# gcc -march=native -Q --help=target|grep march
  -march=                               armv7-a
  Known ARM architectures (for use with the -march= option):

root@imx6dl-albatross2:~# cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 10 (v7l)
BogoMIPS        : 3.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

processor       : 1
model name      : ARMv7 Processor rev 10 (v7l)
BogoMIPS        : 3.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

Hardware        : Freescale i.MX6 Quad/DualLite (Device Tree)
Revision        : 0000
Serial          : 0000000000000000

而板子却只支持VFP3,应该就是这里不一致导致的指令不识别。
所以需要把vfp编译指令重新写。
一开始我在\tensorflow\contrib\lite\tools\make\Makefile里的CXXFLAGS中增加了-mfpu=vfpv3,但是发现生成的还是VFP4的,观察编译时的log,可以看到:

arm-poky-linux-gnueabi-g++  -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a9 --sysroot=/opt/fsl-imx-x11/4.1.15-1.2.0/sysroots/cortexa9hf-vfp-neon-poky-linux-gnueabi -O3 -DNDEBUG -mfpu=vfpv3 -march=armv4t --std=c++11 -march=armv7-a -mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize -fPIC -I. -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/../../../../../ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/../../../../../../ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/eigen -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/absl -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/gemmlowp -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/neon_2_sse -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/farmhash/src -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/flatbuffers/include -I -I/usr/local/include -c tensorflow/contrib/lite/kernels/slice.cc -o /home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/gen/rpi_armv7l/obj/tensorflow/contrib/lite/kernels/slice.o

其中还是有-mfpu=neon-vfpv4,说明还有其他地方设置了,但是却不在Makefile里面。只好在工程里面全局搜索-mfpu,发现\tensorflow\contrib\lite\tools\make\target\rpi_makefile.inc里面还有定义,这个名字一看就是会被调用的,我把其中的-mfpu=neon-vfpv4 \都注释掉了。

    CXXFLAGS += \
      -march=armv7-a \
      -mfpu=neon-vfpv4 \
      -funsafe-math-optimizations \
      -ftree-vectorize \
      -fPIC
    CCFLAGS += \
      -march=armv7-a \
      -mfpu=neon-vfpv4 \
      -funsafe-math-optimizations \
      -ftree-vectorize \
      -fPIC

重新运行编译log里面果然没有再出现VFPv4,编好的label_image在板子上也能顺利运行:

root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 1 -m ./mobilenet_v1_0.25_128_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node  30 Operator Builtin Code  25
  Inputs: 4
  Outputs: 87
invoked
average time: 380.068 ms
0.164706: 401 academic gown
0.145098: 835 suit
0.0745098: 668 mortarboard
0.0745098: 458 bow tie
0.0509804: 653 military uniform

不过好像结果不大好,我看别人都是大概率是military uniform,可能是我用的mobilenet_v1_0.25_128_quant.tflite模型不行,换mobilenet_v1_1.0_224.tflite试试:

root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 1 -m ./mobilenet_v1_1.0_224.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node  30 Operator Builtin Code  25
  Inputs: 31
  Outputs: 86
invoked
average time: 2784.13 ms
0.860174: 653 military uniform
0.0481022: 907 Windsor tie
0.007867: 466 bulletproof vest
0.00644933: 514 cornet
0.00608031: 543 drumstick

果然出来结果对了,不过运行时间也长了很多。
参考https://blog.csdn.net/computerme/article/details/80345065 ,它的结果只要800多ms,看来这个板子可能性能还是不够啊。

换量化后的mobilenet_v1_1.0_224_quant.tflite

root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 4 -m ./mobilenet_v1_1.0_224_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node  30 Operator Builtin Code  25
  Inputs: 4
  Outputs: 87
invoked
average time: 2311.57 ms
0.780392: 653 military uniform
0.105882: 907 Windsor tie
0.0156863: 458 bow tie
0.0117647: 466 bulletproof vest
0.00784314: 835 suit

链接里面那位量化后运行时间显著减少,我的却没有。。。

mobilenet_v2_1.0_224_quant.tflite好像也没有什么改进。。

root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 4 -m ./mobilenet_v2_1.0_224_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node  64 Operator Builtin Code  22
  Inputs: 7 10
  Outputs: 172
invoked
average time: 2073.31 ms
0.717647: 653 military uniform
0.560784: 835 suit
0.533333: 458 bow tie
0.52549: 907 Windsor tie
0.517647: 753 racket

这个时间问题就留到后面解决啦,至少tensorflow lite跑起来了,我可以继续写自己的例子了。
完美的下班!

你可能感兴趣的:(移植TensorFlow Lite到ARM板i.MX6上)