上一篇文章说到移植到LC1860C板上失败后,我又换了一块库更全更新的板子,继续大业。
运行label_image
./label_image -v 1 -m ./mobilenet_v1_1.0_224.tflite -i ./grace_hopper.jpg-l ./imagenet_slim_labels.txt
alloc失败
遇到的第一个问题是alloc失败。
...
83: MobilenetV1/MobilenetV1/Conv2d_9_depthwise/weights_quant/FakeQuantWithMinMaxVars, 1152, 3, 0.0212288, 120
84: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Conv2D_Fold_bias, 512, 2, 0.000260965, 0
85: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6, 8192, 3, 0.0235285, 0
86: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/weights_quant/FakeQuantWithMinMaxVars, 16384, 3, 0.0110914, 146
87: MobilenetV1/Predictions/Reshape_1, 1001, 3, 0.00390625, 0
88: input, 49152, 3, 0.0078125, 128
len: 61306
width, height, channels: -16842752, 1766213120, 246279780
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
一开始我也没有在意,以为是板子太破,tensorflow lite这个label_image
的例子耗内存太大,所以才挂的。准备自己新写一个简单的例子,来看看会不会挂。后来在学习tensorflow准备写例子的间隙里,发现log里的width, height, channels
的值好像不对,怎么这么大还有负的。于是去仔细看了label_image
的代码,发现执行到这里根本还没有invoke
,也就是还没有开始跑tensorflow lite,再发现代码里面停在了read_bmp
里面,才后知后觉发现我给的图片格式是jpg
而不是bmp
的,感紧换成bmp
的,就没有这个alloc问题了。
std::vector in = read_bmp(s->input_bmp_name, &image_width,
&image_height, &image_channels, s);
illegal instruction
然后遇到的就是illegal instruction问题
...
Node 29 Operator Builtin Code 22
Inputs: 1 5
Outputs: 4
Node 30 Operator Builtin Code 25
Inputs: 4
Outputs: 87
Illegal instruction
以前并没有遇到过Illegal instruction
的问题。一开始还以为是tensorflow报的log,在代码里面找了一圈没找到这个log,上网查了才知道这个是linux报的。通常是程序的某条指令板子上的CPU不识别,一般编译时候的架构选择的与板子上的ARM实际架构不符合不兼容导致的,所以还是环境的锅。
从网上知道这个illegal instruction
其实是一种core dump
,那就从core
文件开始吧。
此处必须感谢:https://blog.csdn.net/chyxwzn/article/details/8879750?utm_source=tuicool
ulimit -c unlimited
重新跑一遍label_image
得到core
文件,上gdb
。
(gdb) bt
#0 0x0001f030 in tflite::optimized_ops::ResizeBilinear(tflite::ResizeBilinearParams const&, tflite::RuntimeShape const&, float const*, tflite::RuntimeShape const&, int const*, tflite::RuntimeShape const&, float*) ()
#1 0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p $pc
$1 = (void (*)()) 0x1f030
(gdb) p $sp
$2 = (void *) 0x7e9e43c8
(gdb) x/5i $pc
=> 0x1f030 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7720>:
vfma.f32 s14, s13, s15
0x1f034 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7724>:
vstmia r2!, {s14}
0x1f038 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7728>:
bgt 0x1f01c <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7700>
0x1f03c <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7732>:
vorr d30, d16, d16
0x1f040 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7736>:
vorr d31, d17, d17
由此可以看出是挂在vfma.f32 s14, s13, s15
命令上。看起来是我手上这个板子不支持这个VFM浮点操作。
先查编好的程序平台属性:
root@imx6dl-albatross2:~/march_build# readelf -A label_image
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "7-A"
Tag_CPU_arch: v7
Tag_CPU_arch_profile: Application
Tag_ARM_ISA_use: Yes
Tag_THUMB_ISA_use: Thumb-2
Tag_FP_arch: VFPv4
Tag_Advanced_SIMD_arch: NEONv1 with Fused-MAC
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_rounding: Needed
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align_needed: 8-byte
Tag_ABI_align_preserved: 8-byte, except leaf SP
Tag_ABI_enum_size: int
Tag_ABI_VFP_args: VFP registers
Tag_CPU_unaligned_access: v6
...
说明用的是VFPv4指令集。
再查看板子的情况:
root@imx6dl-albatross2:~# gcc -march=native -Q --help=target|grep march
-march= armv7-a
Known ARM architectures (for use with the -march= option):
root@imx6dl-albatross2:~# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 3.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10
processor : 1
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 3.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10
Hardware : Freescale i.MX6 Quad/DualLite (Device Tree)
Revision : 0000
Serial : 0000000000000000
而板子却只支持VFP3,应该就是这里不一致导致的指令不识别。
所以需要把vfp编译指令重新写。
一开始我在\tensorflow\contrib\lite\tools\make\Makefile
里的CXXFLAGS
中增加了-mfpu=vfpv3
,但是发现生成的还是VFP4的,观察编译时的log,可以看到:
arm-poky-linux-gnueabi-g++ -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a9 --sysroot=/opt/fsl-imx-x11/4.1.15-1.2.0/sysroots/cortexa9hf-vfp-neon-poky-linux-gnueabi -O3 -DNDEBUG -mfpu=vfpv3 -march=armv4t --std=c++11 -march=armv7-a -mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize -fPIC -I. -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/../../../../../ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/../../../../../../ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/eigen -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/absl -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/gemmlowp -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/neon_2_sse -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/farmhash/src -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/flatbuffers/include -I -I/usr/local/include -c tensorflow/contrib/lite/kernels/slice.cc -o /home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/gen/rpi_armv7l/obj/tensorflow/contrib/lite/kernels/slice.o
其中还是有-mfpu=neon-vfpv4
,说明还有其他地方设置了,但是却不在Makefile里面。只好在工程里面全局搜索-mfpu
,发现\tensorflow\contrib\lite\tools\make\target\rpi_makefile.inc
里面还有定义,这个名字一看就是会被调用的,我把其中的-mfpu=neon-vfpv4 \
都注释掉了。
CXXFLAGS += \
-march=armv7-a \
-mfpu=neon-vfpv4 \
-funsafe-math-optimizations \
-ftree-vectorize \
-fPIC
CCFLAGS += \
-march=armv7-a \
-mfpu=neon-vfpv4 \
-funsafe-math-optimizations \
-ftree-vectorize \
-fPIC
重新运行编译log里面果然没有再出现VFPv4,编好的label_image
在板子上也能顺利运行:
root@imx6dl-albatross2:~/vfpv3_build# ./label_image -v 1 -m ./mobilenet_v1_0.25_128_quant.tflite -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node 30 Operator Builtin Code 25
Inputs: 4
Outputs: 87
invoked
average time: 380.068 ms
0.164706: 401 academic gown
0.145098: 835 suit
0.0745098: 668 mortarboard
0.0745098: 458 bow tie
0.0509804: 653 military uniform
不过好像结果不大好,我看别人都是大概率是military uniform
,可能是我用的mobilenet_v1_0.25_128_quant.tflite
模型不行,换mobilenet_v1_1.0_224.tflite
试试:
root@imx6dl-albatross2:~/vfpv3_build# ./label_image -v 1 -m ./mobilenet_v1_1.0_224.tflite -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node 30 Operator Builtin Code 25
Inputs: 31
Outputs: 86
invoked
average time: 2784.13 ms
0.860174: 653 military uniform
0.0481022: 907 Windsor tie
0.007867: 466 bulletproof vest
0.00644933: 514 cornet
0.00608031: 543 drumstick
果然出来结果对了,不过运行时间也长了很多。
参考https://blog.csdn.net/computerme/article/details/80345065 ,它的结果只要800多ms,看来这个板子可能性能还是不够啊。
换量化后的mobilenet_v1_1.0_224_quant.tflite
root@imx6dl-albatross2:~/vfpv3_build# ./label_image -v 4 -m ./mobilenet_v1_1.0_224_quant.tflite -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node 30 Operator Builtin Code 25
Inputs: 4
Outputs: 87
invoked
average time: 2311.57 ms
0.780392: 653 military uniform
0.105882: 907 Windsor tie
0.0156863: 458 bow tie
0.0117647: 466 bulletproof vest
0.00784314: 835 suit
链接里面那位量化后运行时间显著减少,我的却没有。。。
换mobilenet_v2_1.0_224_quant.tflite
好像也没有什么改进。。
root@imx6dl-albatross2:~/vfpv3_build# ./label_image -v 4 -m ./mobilenet_v2_1.0_224_quant.tflite -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node 64 Operator Builtin Code 22
Inputs: 7 10
Outputs: 172
invoked
average time: 2073.31 ms
0.717647: 653 military uniform
0.560784: 835 suit
0.533333: 458 bow tie
0.52549: 907 Windsor tie
0.517647: 753 racket
这个时间问题就留到后面解决啦,至少tensorflow lite跑起来了,我可以继续写自己的例子了。
完美的下班!