前面文章导航:
ZCU106 XRT环境搭建
ZCU106 XRT Vivado工程分析
ZCU106 XRT PetaLinux工程分析
【XRT Vitis-Tutorials】RTL Kernels测试
【XRT Vitis-Tutorials】C++/RTL Kernel混合编程测试
【XRT Vitis-Tutorials】图像并行计算
【XRT Vitis-Tutorials】cl调度优化
官方文档:
2019.2 Vitis™ Application Acceleration Development Flow Tutorials
Vitis Unified Software Platform Documentation Application Acceleration Development
Vitis Unified Software Platform Documentation Embedded Software Development
Vitis ZCU106 Platform
ZCU106 Vitis Platform
pre-built,直接下载并复制到SD卡即可测试:
ZCU106 Test Image
使用VCU的代码:
zcu106_codec
本篇文章来测试Tutorials中的第4个例子:Convolution Example
该例子中主要目的是对视频进行处理,其中进行了多个实验,从CPU单独运行到最后的RTL加速运行。
该实验使用纯CPU的方法对视频图像处理进行了测试。
在Vitis中创建一个新的Application Project
平台:zcu106vcu_base并命名为conv_system
APP:conv_cpu
我们将需要编译的内容直接添加到src目录下,包括:
design/cpu_src目录下的所有内容
最终的工程目录结构如下图:
由于该实验是CPU Only,所以没有需要进行加速的RTL部分。所以直接使用Vitis进行交叉编译即可。为了更快速的运行程序,将优化编译选项配置成 -O3
将固件复制到SD卡,然后运行命令进行测试
A53 CPU运行测试:
root@zcu106vcu_base:~# cp /mnt/conv_cpu.exe ./conv_cpu.exe
root@zcu106vcu_base:~# cp /mnt/video.mp4 ./video.mp4
root@zcu106vcu_base:~# ./conv_cpu.exe --gray true ./video.mp4 -o ./video_out.mp4
input: ./video.mp4
output: ./video_out.mp4
video size: 1920x1080
nframes: 132
IN COMMAND: ffmpeg -v error -hide_banner -i ./video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt gray -s 1920x1080 -framerate 25 -i - -f mp4
Processing 132 frames of ./video.mp4 ...
[###################################] 100 %
Processed 7.91 MB in 49.259s (21.20 MBps)
root@zcu106vcu_base:~#
由于暂时没有找到方法在A53中运行gprof来分析性能,因此在PC中进行测试。
i5 CPU运行测试:
convolution-tutorial/design/cpu_src$ ./convolve --gray true ./video.mp4 -o ./video_out.mp4
input: ./video.mp4
output: ./video_out.mp4
video size: 1920x1080
nframes: 132
IN COMMAND: ffmpeg -v error -hide_banner -i ./video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -pix_fmt rgba -
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt gray -s 1920x1080 -framerate 25 -i - -f mp4 -q:v 5 -an -codec mpeg4 ./video_out.mp4
Processing 132 frames of ./video.mp4 ...
[###################################] 100 %
Processed 7.91 MB in 8.927s (116.97 MBps)
convolution-tutorial/design/cpu_src$ gprof convolve gmon.out> gprofresult.txt
convolution-tutorial/design/cpu_src$ cat gprofresult.txt
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
93.43 7.53 7.53 132 57.05 57.05 convolve_cpu
6.71 8.07 0.54 132 4.10 4.10 grayscale_cpu
0.00 8.07 0.00 132 0.00 0.00 print_progress(int, int)
0.00 8.07 0.00 1 0.00 0.00 _GLOBAL__sub_I_default_output
***
convolution-tutorial/design/cpu_src$ sudo apt-get install graphivz
convolution-tutorial/design/cpu_src$ pip3 install gprof2dot
convolution-tutorial/design/cpu_src$ gprof2dot gprofresult.txt > gprof_graph.dot
convolution-tutorial/design/cpu_src$ dot -Tpng gprof_graph.dot -o gprof_graph.png
使用graphviz的dot工具分析了程序运行流程,可以看到convolve_cpu函数消耗了大量的时间。
可以看到A53的运行速度1920×1080分辨率,132帧,然后进行conv和灰度计算,总共消耗了49秒的时间,这时间中还包含了编解码消耗的时间。相对于PC运算,效率还是太低了。
A53 CPU的处理速度:132/49 = 2.69 FPS,这与实时视频处理需求的30 FPS,还差的太远了。
我们需要依赖A53的硬件至少实现12倍的加速才能够实现30 FPS的速度。
在Vitis中创建一个新的Application Project
平台:conv_system
APP:conv_kernel
我们将需要编译的内容直接添加到src目录下,包括:
design/multi_cu目录下的所有内容
最终的工程目录结构如下图:
注:斜体加粗 部分是与上一个测试不同的地方
按照例程中的设置,将Conv模块的CU数量设置为4,提高并行处理速度
将固件复制到SD卡,然后运行命令进行测试
A53 CPU运行测试:
root@zcu106vcu_base:~# ./conv_kernel.exe ./conv_kernel.xclbin --gray true --kernel_name convolve_fpga ./video.mp4 -o ./video_out.mp4
input: ./video.mp4
output: ./video_out.mp4
video size: 1920x1080
nframes: 132
IN COMMAND: ffmpeg -v error -hide_banner -i ./video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt gray -s 1920x1080 -framerate 25 -i - -f mp4
[ 3829.312242] [drm] Pid 12273 opened device
Binary Path: ./conv_kernel.xclbin
Processing 132 frames of ./vi[ 3829.318928] [drm] Pid 12273 closed device
deo.mp4 ...
[ 3829.328709] [drm] Pid 12273 opened device
platform Name: Xilinx
Vendor Name : Xilinx
Found Platform
INFO: Importing ./conv_kernel.xclbin
Loading: './conv_kernel.xclbin'
[ 3829.430157] [drm] The XCLBIN already loaded. Don't need to reload.
[ 3829.435742] [drm] Reconfiguration not supported
[###################################] 100 %
FPGA Time: 22.5645 s
FPGA Throughput: 46.2735 MB/s
Processed 7.91 MB in 22.761s (45.87 MBps)
[ 3852.037993] [drm] zocl_free_userptr_bo: obj 0x0000000099078c8b
root@zcu106vcu_base:~#
通过运算时间可以看到处理速度从49秒降低到了22.5秒,总时间减小了一半多。
这个CU=4的配置感觉没有生效,需要考虑一下原因(可能是编码速度太慢?)
A53 CPU的处理速度:132/22.56 = 5.85 FPS
这个工程的代码是自己手动添加的,原有例程中没有
在Vitis中创建一个新的Application Project
平台:conv_system
APP:conv_gray
我们将grayscale在RTL中实现
void grayscale_compute_dataflow(hls::stream& write_stream,
hls::stream& read_stream,
int elements) {
RGBPixel pix_rgb;
GrayPixel pix_gray;
fixed cr(0.30);
fixed cg(0.59);
fixed cb(0.11);
while(elements--) {
read_stream >> pix_rgb;
pix_gray = (pix_rgb.r * cr) + //red
(pix_rgb.g * cg) + // green
(pix_rgb.b * cb); // blue
write_stream << pix_gray;
}
}
注:斜体加粗 部分是与上一个测试不同的地方
因为AXI_Slave Number小于等于16的限制,将Conv和Gray模块的CU数量都设置为1
将固件复制到SD卡,然后运行命令进行测试
A53 CPU运行测试:
软件计算grayscale
root@zcu106vcu_base:~#
root@zcu106vcu_base:~# /mnt/conv_gray.exe /mnt/conv_gray.xclbin --gray true /mnt/video.mp4 -o ./video_gray_g.mp4
input: /mnt/video.mp4
output: ./video_gray_g.mp4
video size: 1920x1080
nframes: 132
IN COMMAND: ffmpeg -v error -hide_banner -i /mnt/video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -pi-
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt gray -s 1920x1080 -framerate 25 -i - -f mp4 -q:v4
Binary Path: /mnt/conv_gray.xclbin
Processing 132 frames of /mnt/video.mp4 ...
[ 81.273251] [drm] Pid 2671 opened device
[ 81.277212] [drm] Pid 2671 closed device
[ 81.281354] [drm] Pid 2671 opened device
platform Name: Xilinx
Vendor Name : Xilinx
Found Platform
INFO: Importing /mnt/conv_gray.xclbin
Loading: '/mnt/conv_gray.xclbin'
[ 81.383444] [drm] The XCLBIN already loaded. Don't need to reload.
[ 81.386711] [drm] Reconfiguration not supported
compute_units = 1 1
lines_per_compute_unit = 1080
gray = 1 0
[## ] 6 %[ 82.924773] print_req_error: I/O error, dev mmcblk0, sector 7689
[ 82.930848] Buffer I/O error on dev mmcblk0p1, logical block 7554, lost async page write
[###################################] 100 %
FPGA Time: 23.4527 s
FPGA Throughput: 44.5211 MB/s
Processed 7.91 MB in 23.664s (44.12 MBps)
[ 104.869503] [drm] zocl_free_userptr_bo: obj 0x00000000c51e768b
[ 104.883850] [drm] Pid 2671 closed device
root@zcu106vcu_base:~#
硬件加速计算grayscale
root@zcu106vcu_base:~# /mnt/conv_gray.exe /mnt/conv_gray.xclbin --gray true --gray_acc true /mnt/video.mp4 -o ./video_gray_g.mp4
input: /mnt/video.mp4
output: ./video_gray_g.mp4
video size: 1920x1080
nframes: 132
IN COMMAND: ffmpeg -v error -hide_banner -i /mnt/video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -pi-
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt gray -s 1920x1080 -framerate 25 -i - -f mp4 -q:v4
Binary Path: /mnt/conv_gray.xclbin
Processing 132 frames of /mnt/video.mp4 ...
[ 169.665196] [drm] Pid 2959 opened device
[ 169.669156] [drm] Pid 2959 closed device
[ 169.673512] [drm] Pid 2959 opened device
platform Name: Xilinx
Vendor Name : Xilinx
Found Platform
INFO: Importing /mnt/conv_gray.xclbin
Loading: '/mnt/conv_gray.xclbin'
[ 169.773466] [drm] The XCLBIN already loaded. Don't need to reload.
[ 169.777814] [drm] Reconfiguration not supported
compute_units = 1 1
lines_per_compute_unit = 1080
gray = 1 1
[###################################] 100 %
FPGA Time: 21.2221 s
FPGA Throughput: 49.2006 MB/s
Processed 7.91 MB in 21.441s (48.70 MBps)
[ 191.037663] [drm] zocl_free_userptr_bo: obj 0x00000000940c51e8
[ 191.052737] [drm] Pid 2959 closed device
root@zcu106vcu_base:~#
由于CU数量改为了1,速度与前一个实验不具备可比性,因此该实验运行了两次,区别是有没有开启–gray_acc这个参数。
开启这个参数时处理时间为21.22秒,不开启时处理时间为23.45秒,计算时间能够降低9.5%左右。
A53 CPU的处理速度:132/21.22 = 6.22 FPS
这个工程的代码是自己手动添加的,原有例程中没有
在Vitis中创建一个新的Application Project
平台:conv_system
APP:conv_codec
注:斜体加粗 部分是与全部CPU运行测试不同的地方
将固件复制到SD卡,然后运行命令进行测试
A53 CPU运行测试:
root@zcu106vcu_base:~# ./conv.exe ./conv.xclbin --cpu true ./video.mp4 -o ./video_color_cpu.mp4
input: ./video.mp4
output: ./video_color_cpu.mp4
video size: 1920x1080
nframes: 132
Accel:OFF
VCU decoder:OFF encoder:OFF
IN COMMAND: ffmpeg -v error -hide_banner -i ./video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -pix_fmt rgba -
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt rgba -s 1920x1080 -framerate 25 -i - -f mp4 -q:v 5 -an -codec mpeg4 ./video_color_cpu.mp4
Binary Path: ./conv.xclbin
Processing 132 frames of ./video.mp4 ...
[###################################] 100 %
Processed 7.91 MB in 54.541s (19.14 MBps)
root@zcu106vcu_base:~# mkdir video_color_cpu
root@zcu106vcu_base:~# cp xclbin.run_summary video_color_cpu && cp timeline_trace.csv video_color_cpu && cp profile_summary.csv video_color_cpu && cp video_color_cpu.mp4 video_color_cpu
root@zcu106vcu_base:~#
root@zcu106vcu_base:~# ./conv.exe ./conv.xclbin --cpu true --gray true ./video.mp4 -o ./video_gray_cpu.mp4
input: ./video.mp4
output: ./video_color_cpu.mp4
video size: 1920x1080
nframes: 132
Accel:OFF
VCU decoder:OFF encoder:OFF
IN COMMAND: ffmpeg -v error -hide_banner -i ./video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -pix_fmt rgba -
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt gray -s 1920x1080 -framerate 25 -i - -f mp4 -q:v 5 -an -codec mpeg4 ./video_gray_cpu.mp4
Binary Path: ./conv.xclbin
Processing 132 frames of ./video.mp4 ...
[###################################] 100 %
Processed 7.91 MB in 58.208s (17.94 MBps)
root@zcu106vcu_base:~# mkdir video_gray_cpu
root@zcu106vcu_base:~# cp xclbin.run_summary video_gray_cpu && cp timeline_trace.csv video_gray_cpu && cp profile_summary.csv video_gray_cpu && cp video_gray_cpu.mp4 video_gray_cpu
root@zcu106vcu_base:~#
root@zcu106vcu_base:~# ./conv.exe ./conv.xclbin ./video.mp4 -o ./video_color_vcu_none.mp4
input: ./video.mp4
output: ./video_color_vcu_none.mp4
video size: 1920x1080
nframes: 132
Accel:ON
VCU decoder:OFF encoder:OFF
IN COMMAND: ffmpeg -v error -hide_banner -i ./video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -pix_fmt rgba -
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt rgba -s 1920x1080 -framerate 25 -i - -f mp4 -q:v 5 -an -codec mpeg4 ./video_color_vcu_none.mp4
Binary Path: ./conv.xclbin
Processing 132 frames of ./video.mp4 ...
Found Platform Number: 1
platform Name: Xilinx
Vendor Name : Xilinx
Found Platform
devices number : 1
INFO: Importing ./conv.xclbin
Loading: './conv.xclbin'
compute_units = 1 1
lines_per_compute_unit = 1080
gray = 0 0
[###################################] 100 %
FPGA Time: 45.9614 s
FPGA Throughput: 22.7178 MB/s
Processed 7.91 MB in 46.137s (22.63 MBps)
root@zcu106vcu_base:~# mkdir video_color_vcu_none
root@zcu106vcu_base:~# cp xclbin.run_summary video_color_vcu_none && cp timeline_trace.csv video_color_vcu_none && cp profile_summary.csv video_color_vcu_none && cp video_color_vcu_none.mp4 video_color_vcu_none
root@zcu106vcu_base:
root@zcu106vcu_base:~# ./conv.exe ./conv.xclbin --gray true --gray_acc true ./video.mp4 -o ./video_gray_vcu_none.mp4
input: ./video.mp4
output: ./video_gray_vcu_none.mp4
video size: 1920x1080
nframes: 132
Accel:ON
VCU decoder:OFF encoder:OFF
IN COMMAND: ffmpeg -v error -hide_banner -i ./video.mp4 -f image2pipe -vcodec rawvideo -vf scale=w=1920:h=1080 -vframes 132 -pix_fmt rgba -
OUT COMMAND: ffmpeg -v error -hide_banner -y -f rawvideo -vcodec rawvideo -pix_fmt gray -s 1920x1080 -framerate 25 -i - -f mp4 -q:v 5 -an -codec mpeg4 ./video_gray_vcu_none.mp4
Binary Path: ./conv.xclbin
Processing 132 frames of ./video.mp4 ...
Found Platform Number: 1
platform Name: Xilinx
Vendor Name : Xilinx
Found Platform
devices number : 1
INFO: Importing ./conv.xclbin
Loading: './conv.xclbin'
compute_units = 1 1
lines_per_compute_unit = 1080
gray = 1 1
[###################################] 100 %
FPGA Time: 21.1492 s
FPGA Throughput: 49.3702 MB/s
Processed 7.91 MB in 21.328s (48.96 MBps)
root@zcu106vcu_base:~# mkdir video_gray_vcu_none
root@zcu106vcu_base:~# cp xclbin.run_summary video_gray_vcu_none && cp timeline_trace.csv video_gray_vcu_none && cp profile_summary.csv video_gray_vcu_none && cp video_gray_vcu_none.mp4 video_gray_vcu_none
root@zcu106vcu_base:~#
root@zcu106vcu_base:~#
root@zcu106vcu_base:~# ./conv.exe ./conv.xclbin --enc true --dec true ./video.mp4 -o ./video_color_vcu_all.mp4
input: ./video.mp4
output: ./video_color_vcu_all.mp4
video size: 1920x1080
nframes: 132
Accel:ON
VCU decoder:ON encoder:ON
IN COMMAND: filesrc location=./video.mp4 ! queue ! qtdemux ! queue ! h264parse ! video/x-h264, alignment=au ! queue ! omxh264dec ! video/x-raw,format=NV12,width=1920,height=1080 ! queue ! appsink
OUT COMMAND: appsrc ! queue ! videoconvert ! video/x-raw,width=1920,height=1080 ! queue ! omxh264enc target-bitrate=2000 ! video/x-h264, alignment=au ! queue ! capsfilter ! h264parse ! queue ! qtmux ! queue ! filesink location=./video_color_vcu_all.mp4
Binary Path: ./conv.xclbin
Processing 132 frames of ./video.mp4 ...
Found Platform Number: 1
platform Name: Xilinx
Vendor Name : Xilinx
Found Platform
devices number : 1
INFO: Importing ./conv.xclbin
Loading: './conv.xclbin'
compute_units = 1 1
lines_per_compute_unit = 1080
gray = 0 0
inFrameMatNV12 1920 1620 1 0
outFrameMatRGBA 1920 1080 4 24
[ ] 1 %!! Warning : Adapting profile to support bitdepth and chroma mode
!! The specified Level is too low and will be adjusted !!
[################################## ] 99 %
Error: partial frame 131 read failed
FPGA Time: 9.29383 s
FPGA Throughput: 112.348 MB/s
Processed 7.91 MB in 9.467s (110.29 MBps)
root@zcu106vcu_base:~# mkdir video_color_vcu_all
root@zcu106vcu_base:~# cp xclbin.run_summary video_color_vcu_all && cp timeline_trace.csv video_color_vcu_all && cp profile_summary.csv video_color_vcu_all && cp video_color_vcu_all.mp4 video_color_vcu_all
root@zcu106vcu_base:~#
root@zcu106vcu_base:~# ./conv.exe ./conv.xclbin --gray true --gray_acc true --enc true --dec true ./video.mp4 -o ./video_gray_vcu_all.mp4
input: ./video.mp4
output: ./video_gray_vcu_all.mp4
video size: 1920x1080
nframes: 132
Accel:ON
VCU decoder:ON encoder:ON
IN COMMAND: filesrc location=./video.mp4 ! queue ! qtdemux ! queue ! h264parse ! video/x-h264, alignment=au ! queue ! omxh264dec ! video/x-raw,format=NV12,width=1920,height=1080 ! queue ! appsink
OUT COMMAND: appsrc ! queue ! videoconvert ! video/x-raw,width=1920,height=1080 ! queue ! omxh264enc target-bitrate=2000 ! video/x-h264, alignment=au ! queue ! capsfilter ! h264parse ! queue ! qtmux ! queue ! filesink location=./video_gray_vcu_all.mp4
Binary Path: ./conv.xclbin
Processing 132 frames of ./video.mp4 ...
Found Platform Number: 1
platform Name: Xilinx
Vendor Name : Xilinx
Found Platform
devices number : 1
INFO: Importing ./conv.xclbin
Loading: './conv.xclbin'
compute_units = 1 1
lines_per_compute_unit = 1080
gray = 1 1
inFrameMatNV12 1920 1620 1 0
outFrameMatGRAY 1920 1080 1 0
[ ] 1 %!! Warning : Adapting profile to support bitdepth and chroma mode
!! The specified Level is too low and will be adjusted !!
[################################## ] 99 %
Error: partial frame 131 read failed
FPGA Time: 10.6679 s
FPGA Throughput: 97.8773 MB/s
Processed 7.91 MB in 10.841s (96.31 MBps)
root@zcu106vcu_base:~# mkdir video_gray_vcu_all
root@zcu106vcu_base:~# cp xclbin.run_summary video_gray_vcu_all && cp timeline_trace.csv video_gray_vcu_all && cp profile_summary.csv video_gray_vcu_all && cp video_gray_vcu_all.mp4 video_gray_vcu_all
root@zcu106vcu_base:~#
由于使用gstreamer的方法无法完成gray或rgba到H264 encoder的格式匹配。所以中间添加了颜色转换的操作(OpenCV的方法),速度与前一个实验不具备可比性。未知异常,只处理了131帧,少了1帧。
处理时间为10.67秒
A53 CPU的处理速度:132/10.67 = 12.28 FPS
与目标30FPS还差了很多,后续还需要进行整体优化。
在测试时使用性能分析工具:Vitis Analyzer
在运行程序的当前目录添加xrt.ini
默认打开性能分析工具
[Debug]
profile=true
data_transfer_trace=fine
stall_trace=all
timeline_trace=true
root@zcu106vcu_base:~#
我们只接分析最终10.67秒的这次处理
可以看到
convolve_fpga的运行时比较不稳定,平均34ms,最大131ms,总共消耗4571ms。其中有5次运行超过100ms。
grayscale_fpga类似,平均16ms,最大40ms,总计消耗了2098ms。
最后一张图,可以看到虽然convolve_fpga和grayscale_fpga可以被队列化运行,但是没有合codec进行流水线。可以
Memor读消耗317ms,写消耗1105ms
优化考虑:
使用Vitis和自定义的ZCU106 XRT平台完成了Vitis-Tutorials中的Convolution Example功能测试。并使用VCU加速了编解码速度。