MTCNN的FPGA实现(一)SDK端程序的初步编写

 背景:已经将IPcore集成为系统并且可以成功调用,现在我们需要在单片机端编写SDK程序并且调用IPcore进行测试。

目的:编写SDK程序调用单片机端IPcore。在不保证正确率的情况下先测试一下大概的帧率。

目录

一、单次调用IPcore

1.1 malloc方式实现内存

1.2 关于DDR调用的相关

 1.3 调用IPcore

1.4 指针偏移值的问题

1.5 依然存在的问题

二、PS端实现卷积

2.1 zynq的SDK测试时间函数

2.2 PS中malloc开辟空间时卷积时间

2.3 PS运用DDR运算卷积

 2.4 PS端程序及打印信息

三、PS端帧率测试

3.1 一些层的卷积时间

3.2 卷积层封装

3.3 网络在PS中的模拟


一、单次调用IPcore

系统生成的单片机相应的地址信息会在system.hdf里面

MTCNN的FPGA实现(一)SDK端程序的初步编写_第1张图片

程序的想法是运用单片机往IPcore写数据,然后用IPcore调用实现卷积。

XConvolution_3x3 XConvolution_3x3_Core;

int main()
{
    init_platform();
	printf("\n --------------program start------------- \n");

	//initialize IPcore
	XConvolution_3x3_Initialize(&XConvolution_3x3_Core,XPAR_CONVOLUTION_3X3_0_DEVICE_ID);
	printf("Initialize XConvolution_3x3_Core IPcore SUCCESS!\n");

	//conv parameters
	int inputSize=5; int inChannelNum=3;
	int outputSize=3;  int OutChannelNum=3;
	int kernelSize=3; int Stride=1;
	int Input_Pixels=inputSize*inputSize*inChannelNum;
	int Output_Pixels=outputSize*outputSize*OutChannelNum;
	int weightkernel_Pixels=9*inChannelNum*OutChannelNum;

	//conv variable
	struct Weight weightIn;
	struct pBox featureIn;
	struct pBox conv_PL_out;
	struct pBox conv_PS_out;

	//initialize conv weight variable
	weightIn.out_ChannelNum=OutChannelNum;
	weightIn.in_ChannelNum=inChannelNum;
	weightIn.kernelSize=kernelSize;
	weightIn.stride=Stride;
	//init convolution ptr to DRAM
	//weightIn.pdata=(volatile float *)0x10000000;
	weightIn.pdata=(volatile float *)malloc(sizeof(float)*weightkernel_Pixels);
	print("Weight parameter init SUCCESS!\n\r");
	//init convolution ptr to DRAM
	for (int i=0;i

IPcore调用成功的标志就是isDone函数的值为1,我们需要打印出IsDone函数的信息。

1.1 malloc方式实现内存

运用malloc实现相应的内存,这样的结果为卷积顺利实行,但是IPcore未能顺利实行。

 --------------program start-------------
Initialize XConvolution_3x3_Core IPcore SUCCESS!
Weight parameter init SUCCESS!
Weight in DRAM init SUCCESS!
Feature in DRAM init SUCCESS!
Output variable init SUCCESS!
Input_Pixels is 75 and hex memory size is 0000012c
weight_pixels is 81 and hex memory size is 00000144
Output_Pixels is 27 and hex memory size is 0000006c
Input pointer value is 00118898
Weight pointer value is 00118750
Output PS pointer value is 001189c8
Output PL pointer value is 00118a38
Conv in PS SUCCESS!
IP core return is 0
IP core isDone is 0
Set IP core conv parameters SUCCESS!
Set IP core start SUCCESS!
IP core return is 0
IP core isDone is 1
IP core not Done!
IP core return is 0
IP core isDone is 0
Conv in PL done SUCCESS!
Convolution ERROR!
i is 0, value in PS= 1.521900, in PL is 0.000000
Convolution ERROR!
i is 1, value in PS= 1.557000, in PL is 0.000000

说明malloc调用的内存空间不够并且IPcore不能使用malloc的指针进行运算。 且IPcore可能死机导致卡死。isDone函数不能运行。

1.2 关于DDR调用的相关

据下面资料:https://blog.csdn.net/yqq654101/article/details/80373971

PL作为主设备访问ddr时所用的地址是0x40000000-0x7FFFFFFF;

所以在搭建环境的时候要注意pl所给的地址为0x40000000-0x7FFFFFFF才能访问到PS的DDR。

另外值得注意的是Dcache的一致性问题,为了PL可以直接读到DDR的数据,在SDK初始化的时候用到的是mem test的模板的init_plateform()进行初始化,关掉Dcache,这样PL才能直接读到DDR,不然PS读的是缓存区的数据!

 1.3 调用IPcore

 --------------program start-------------
init network parameters run time is 0.000474 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690571 mili second
Initialize XConvolution_3x3_Core IPcore SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 0
IP core get inHight is 5
IP core get weight prt is 1000000
Set conv parameters SUCCESS!IPcore conv start SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 1
IP core get inHight is 5
IP core get weight prt is 1000000
Strat again SUCCESS!
IP core Done SUCCESS!0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 IP core not Done!FAILURE!
Conv run time is 2046.292483 mili second
Input_Pixels is 50 and hex memory size is 000000c8
weight_pixels is 36 and hex memory size is 00000090
Output_Pixels is 18 and hex memory size is 00000048
Input pointer value is 01000090
Weight pointer value is 01000000
Output PS pointer value is 010000d8
InputSize is    5, In_channels   is   2, Input_Pixels  is  50
OutputSize is   3, Out_channels  is   2, Output_Pixels is  18
Stride   is     1, weight_pixels is  36
------------Program End SUCCESS!-----------

我们发现一次start对应一次isDone,一旦取出isDone则isDone重新置为0  但是可能写入的问题,IPcore没有写入DDR对应位置的数据。

MTCNN的FPGA实现(一)SDK端程序的初步编写_第2张图片

卷积值较小时瞬间完成:

 --------------program start-------------
init network parameters run time is 0.000471 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690556 mili second
Initialize XConvolution_3x3_Core IPcore SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 0
---------print IP core value---------
IP core return is 0
IP core isDone is 0
IP core get inHight is 5
IP core get weight prt is 1000000
Set conv parameters SUCCESS!
IPcore conv start SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 1
---------print IP core value---------
IP core return is 0
IP core isDone is 0
IP core get inHight is 8
IP core get weight prt is 1000000
Strat SUCCESS!
IP core Done SUCCESS!Conv run time is 50.519705 mili second
Input_Pixels is 192 and hex memory size is 00000300
weight_pixels is 54 and hex memory size is 000000d8
Output_Pixels is 72 and hex memory size is 00000120
Input pointer value is 010000d8
Weight pointer value is 01000000
Output PS pointer value is 010001f8
InputSize is    8, In_channels   is   3, Input_Pixels  is 192
OutputSize is   6, Out_channels  is   2, Output_Pixels is  72
Stride   is     1, weight_pixels is  54
------------Program End SUCCESS!-----------

卷积尺寸一旦设置大,则下次运行就会死机:可能因为指针的问题。

 --------------program start-------------
init network parameters run time is 0.000471 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690643 mili second
Initialize XConvolution_3x3_Core IPcore SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 0
---------print IP core value---------
IP core return is 0
IP core isDone is 0
IP core get inHight is 0
IP core get weight prt is 0
Set conv parameters SUCCESS!
IPcore conv start SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 0
---------print IP core value---------
IP core return is 0
IP core isDone is 0
IP core get inHight is 23
IP core get weight prt is 1000000
Strat SUCCESS!
IPcore Not Done! current sleep times is 0
IPcore Not Done! current sleep times is 1
IPcore Not Done! current sleep times is 2
IPcore Not Done! current sleep times is 3
IPcore Not Done! current sleep times is 4
IPcore Not Done! current sleep times is 5
IPcore Not Done! current sleep times is 6
IP core Done SUCCESS!Conv run time is 77.819314 mili second
Input_Pixels is 16928 and hex memory size is 00010880
weight_pixels is 18432 and hex memory size is 00012000
Output_Pixels is 28224 and hex memory size is 0001b900
Input pointer value is 01012000
Weight pointer value is 01000000
Output PS pointer value is 0102d900
InputSize is   23, In_channels   is  32, Input_Pixels  is 16928
OutputSize is  21, Out_channels  is  64, Output_Pixels is 28224
Stride   is     1, weight_pixels is 18432
------------Program End SUCCESS!-----------

 --------------program start-------------

1.4 指针偏移值的问题

 指针偏移值影响到最终结果的调用,先用insize为8,outsize为6,channel分别为3,2的小卷积实验。

0x00000000,失败

0x00100000,失败,

0x00101111,失败

0x01000000 ,成功

0x02000000,成功

0x10000000,成功

0x20000000, 成功

0x30000000,成功

0x40000000,失败,且死机

前面几个低位的偏移地址不知道为何不能成功,但是偏移地址较高时是可以运行的。

1.5 依然存在的问题

  • IPcore的指针与单片机端的指针不一样,卷积后写入的位置不知道在哪。
  • 卷积尺寸过大时IPcore只能调用一次,重复调用会导致单片机死机。

二、PS端实现卷积

2.1 zynq的SDK测试时间函数

//需要添加的头文件
#include "xtime_l.h"
#include "sleep.h"

//程序开头
	XTime timeEnd, timeStart;
	double timeUsed;
	XTime_GetTime(&timeStart);
//计时
	int sec=10;
	int mili_sec=sec*1000;
	usleep(1000*mili_sec);
//程序结束
	XTime_GetTime(&timeEnd);
	timeUsed = (timeEnd-timeStart)/(double)COUNTS_PER_SECOND;
	
	printf("time end is   %llu \n",timeEnd);
	printf("time start is %llu \n",timeStart);
	printf("timeEnd-timeStart is %llu \n",timeEnd-timeStart);
	printf("COUNTS_PER_SECOND is %d \n",COUNTS_PER_SECOND);
	printf("usleep time is %lf mili second\n",1000*timeUsed);

注意其时间位数可能为64为,即long int,所以下面这种写法就可能会导致内存溢出然后显示的时间不对。

timeUsed = (((float)timeEnd-(float)timeStart))/((float)COUNTS_PER_SECOND);

usleep的时间为微妙,即秒的10的-6次方。

2.2 PS中malloc开辟空间时卷积时间

运用malloc开辟空间,卷积的时间为:

InputSize is    5, In_channels   is   3, Input_Pixels  is  75
OutputSize is   3, Out_channels  is   3, Output_Pixels is  27
Stride   is     1, weight_pixels is  81
Total run time is 33.950603 mili second

InputSize is   10, In_channels   is   3, Input_Pixels  is 300
OutputSize is   8, Out_channels  is   3, Output_Pixels is 192
Stride   is     1, weight_pixels is  81
Total run time is 34.128139 mili second

不调用DRAM的情况。卷积尺寸过大时不能运用malloc函数。因为PS之中ram的空间不够用。

Input_Pixels is 33856 and hex memory size is 00021100
weight_pixels is 36864 and hex memory size is 00024000
Output_Pixels is 28224 and hex memory size is 0001b900
Input pointer value is 00000000
Weight pointer value is 00000000
Output PS pointer value is 00000000
The feature is NULL!
InputSize is   23, In_channels   is  64, Input_Pixels  is 33856
OutputSize is  21, Out_channels  is  64, Output_Pixels is 28224
Stride   is     1, weight_pixels is 36864
Total run time is 36.471764 mili second


Input_Pixels is 1875 and hex memory size is 00001d4c
weight_pixels is 81 and hex memory size is 00000144
Output_Pixels is 1587 and hex memory size is 000018cc
Input pointer value is 00000000
Weight pointer value is 00114740
Output PS pointer value is 00000000
The feature is NULL!
InputSize is   25, In_channels   is   3, Input_Pixels  is 1875
OutputSize is  23, Out_channels  is   3, Output_Pixels is 1587
Stride   is     1, weight_pixels is  81

2.3 PS运用DDR运算卷积

之前用malloc开辟空间,我们现在运用DDR开辟空间。

但是把此地址作为指针的值并不能使PS正常运行。第一个偏移地址应该设为0x01000000基本可以。难道DDR的调用地址会出现冲突?为了避免程序出错,我们暂时将地址设为

int ps7_ddr_0_loc=0x01000000;

	//init convolution ptr to DRAM
	weightIn.pdata=(volatile float *)ps7_ddr_0_loc;
	//weightIn.pdata=(volatile float *)malloc(sizeof(float)*weightkernel_Pixels);
	featureIn.pdata=(volatile float *)((unsigned int)weightIn.pdata+weightkernel_Pixels*sizeof(float));
	//featureIn.pdata=(volatile float*)malloc(sizeof(float)*Input_Pixels);
	conv_PS_out.pdata=(volatile float *)((unsigned int)featureIn.pdata+sizeof(float)*Output_Pixels);
	//conv_PS_out.pdata=(volatile float*)malloc(sizeof(float)*Output_Pixels);
	

 2.4 PS端程序及打印信息

PS端程序:

	printf("\n --------------program start------------- \n");
	
//----------------------------init network parameters---------------
	XTime timeEnd, timeStart;
	float timeUsed;
	XTime_GetTime(&timeStart);

	//conv parameters
	int inputSize=26; int inChannelNum=32;
	int outputSize=24;  int OutChannelNum=64;
	int kernelSize=3; int Stride=1;
	int Input_Pixels=inputSize*inputSize*inChannelNum;
	int Output_Pixels=outputSize*outputSize*OutChannelNum;
	int weightkernel_Pixels=9*inChannelNum*OutChannelNum;

	//conv variable
	struct Weight weightIn;
	struct pBox featureIn;
	struct pBox conv_PS_out;

	//initialize conv weight variable
	weightIn.out_ChannelNum=OutChannelNum;
	weightIn.in_ChannelNum=inChannelNum;
	weightIn.kernelSize=kernelSize;
	weightIn.stride=Stride;
	
	//initialize conv Input variable
	featureIn.width=inputSize;
	featureIn.height=inputSize;
	featureIn.channel=inChannelNum;
	
	//initialize conv Output variable
	conv_PS_out.width=outputSize;
	conv_PS_out.height=outputSize;
	conv_PS_out.channel=OutChannelNum;
	
	//init convolution ptr to DRAM
	weightIn.pdata=(volatile float *)ps7_ddr_0_loc;
	//weightIn.pdata=(volatile float *)malloc(sizeof(float)*weightkernel_Pixels);
	featureIn.pdata=(volatile float *)((unsigned int)weightIn.pdata+weightkernel_Pixels*sizeof(float));
	//featureIn.pdata=(volatile float*)malloc(sizeof(float)*Input_Pixels);
	conv_PS_out.pdata=(volatile float *)((unsigned int)featureIn.pdata+sizeof(float)*Output_Pixels);
	//conv_PS_out.pdata=(volatile float*)malloc(sizeof(float)*Output_Pixels);
	
	XTime_GetTime(&timeEnd);
	timeUsed = (((float)timeEnd-(float)timeStart))/((float)COUNTS_PER_SECOND);
	printf("init network parameters run time is %f mili second\n",1000*timeUsed);
	
//-----------------write conv data to DRAM---------------------------
	XTime_GetTime(&timeStart);
	
	//init weight data to DRAM
	for (int i=0;i

打印信息:

 --------------program start-------------
init network parameters run time is 0.000507 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690556 mili second
Conv times is 0
Conv times is 1
Conv times is 2
Conv in PS run time is 2338.129639 mili second
Input_Pixels is 21632 and hex memory size is 00015200
weight_pixels is 18432 and hex memory size is 00012000
Output_Pixels is 36864 and hex memory size is 00024000
Input pointer value is 01012000
Weight pointer value is 01000000
Output PS pointer value is 01036000
InputSize is   26, In_channels   is  32, Input_Pixels  is 21632
OutputSize is  24, Out_channels  is  64, Output_Pixels is 36864
Stride   is     1, weight_pixels is 18432
------------Program End SUCCESS!-----------

三、PS端帧率测试

3.1 一些层的卷积时间

最终的网络结构:https://blog.csdn.net/weixin_36474809/article/details/84578946#%E4%BA%8C%E3%80%81%E9%87%87%E7%94%A8%E7%BD%91%E7%BB%9C%E7%BB%93%E6%9E%84%E8%A1%A8

Onet

 --------------program start-------------
init network parameters run time is 0.000507 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690517 mili second
Conv times is 0
Conv in PS run time is 780.324097 mili second
Input_Pixels is 21632 and hex memory size is 00015200
weight_pixels is 18432 and hex memory size is 00012000
Output_Pixels is 36864 and hex memory size is 00024000
Input pointer value is 01012000
Weight pointer value is 01000000
Output PS pointer value is 01036000
InputSize is   26, In_channels   is  32, Input_Pixels  is 21632
OutputSize is  24, Out_channels  is  64, Output_Pixels is 36864
Stride   is     1, weight_pixels is 18432
------------Program End SUCCESS!-----------

Rnet

init network parameters run time is 0.000501 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690679 mili second
Conv times is 0
Conv in PS run time is 129.388367 mili second
Input_Pixels is 5488 and hex memory size is 000055c0
weight_pixels is 12096 and hex memory size is 0000bd00
Output_Pixels is 6912 and hex memory size is 00006c00
Input pointer value is 0100bd00
Weight pointer value is 01000000
Output PS pointer value is 01012900
InputSize is   14, In_channels   is  28, Input_Pixels  is 5488
OutputSize is  12, Out_channels  is  48, Output_Pixels is 6912
Stride   is     1, weight_pixels is 12096

MTCNN的FPGA实现(一)SDK端程序的初步编写_第3张图片

假设输入图像480×480×3

init network parameters run time is 0.000510 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 41.652718 mili second
Conv times is 0
Conv in PS run time is 4555.694824 mili second
Input_Pixels is 691200 and hex memory size is 002a3000
weight_pixels is 270 and hex memory size is 00000438
Output_Pixels is 2284840 and hex memory size is 008b74a0
Input pointer value is 01000438
Weight pointer value is 01000000
Output PS pointer value is 018b78d8
InputSize is  480, In_channels   is   3, Input_Pixels  is 691200
OutputSize is 478, Out_channels  is  10, Output_Pixels is 2284840
Stride   is     1, weight_pixels is 270

Stride为2时

init network parameters run time is 0.000501 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690502 mili second
Conv times is 0
Conv in PS run time is 4.155555 mili second
Input_Pixels is 1210 and hex memory size is 000012e8
weight_pixels is 1440 and hex memory size is 00001680
Output_Pixels is 400 and hex memory size is 00000640
Input pointer value is 01001680
Weight pointer value is 01000000
Output PS pointer value is 01001cc0
InputSize is   11, In_channels   is  10, Input_Pixels  is 1210
OutputSize is   5, Out_channels  is  16, Output_Pixels is 400
Stride   is     2, weight_pixels is 1440

3.2 卷积层封装

将层的卷积封装到一个函数之中,便于层的实现。

void layer_conv(int inputSize, int inChannelNum,
				int outputSize,int OutChannelNum,
				int kernelSize,int Stride){
					
	//conv parameters
	int Input_Pixels=inputSize*inputSize*inChannelNum;
	int Output_Pixels=outputSize*outputSize*OutChannelNum;
	int weightkernel_Pixels=9*inChannelNum*OutChannelNum;
	//conv variable
	struct Weight weightIn;
	struct pBox featureIn;
	struct pBox conv_PS_out;
	//initialize conv weight variable
	weightIn.out_ChannelNum=OutChannelNum;
	weightIn.in_ChannelNum=inChannelNum;
	weightIn.kernelSize=kernelSize;
	weightIn.stride=Stride;
	//initialize conv Input variable
	featureIn.width=inputSize;
	featureIn.height=inputSize;
	featureIn.channel=inChannelNum;
	//initialize conv Output variable
	conv_PS_out.width=outputSize;
	conv_PS_out.height=outputSize;
	conv_PS_out.channel=OutChannelNum;
	//init convolution ptr to DRAM
	weightIn.pdata=(volatile float *)ps7_ddr_0_loc;
	//weightIn.pdata=(volatile float *)malloc(sizeof(float)*weightkernel_Pixels);
	featureIn.pdata=(volatile float *)((unsigned int)weightIn.pdata+weightkernel_Pixels*sizeof(float));
	//featureIn.pdata=(volatile float*)malloc(sizeof(float)*Input_Pixels);
	conv_PS_out.pdata=(volatile float *)((unsigned int)featureIn.pdata+sizeof(float)*Output_Pixels);
	//conv_PS_out.pdata=(volatile float*)malloc(sizeof(float)*Output_Pixels);
	
	//init weight data to DRAM
	for (int i=0;i

调用:

	XTime timeEnd, timeStart;
	float timeUsed;
	XTime_GetTime(&timeStart);
	//         inSize   inChannel  outSize  outChannel kernel stride
	layer_conv(14,      28,        12,      48,        3,     1);

	XTime_GetTime(&timeEnd);
	timeUsed = (((float)timeEnd-(float)timeStart))/((float)COUNTS_PER_SECOND);
	printf("layer conv time is %f mili second\n",1000*timeUsed);

3.3 网络在PS中的模拟

	Pnet1:{
		printf("Pnet1\n");
		//         inSize   inChannel  outSize  outChannel kernel stride
		layer_conv(480,      3,        478,      10,        3,     1);
		layer_conv(478+1,    10,       239,      16,        3,     2);
		layer_conv(239,      16,       237,      32,        3,     1);
		layer_conv(237,      32,       235,      32,        3,     1);
	}
	
	Pnet2:{
		printf("Pnet2\n");
		//         inSize   inChannel  outSize  outChannel kernel stride
		layer_conv(240,      3,        238,      10,        3,     1);
		layer_conv(238+1,    10,       129,      16,        3,     2);
		layer_conv(129,      16,       127,      32,        3,     1);
		layer_conv(127,      32,       125,      32,        3,     1);
	}
	
	Pnet3:{
		printf("Pnet3\n");
		//         inSize   inChannel  outSize  outChannel kernel stride
		layer_conv(120,      3,        118,      10,        3,     1);
		layer_conv(118+1,    10,       59,       16,        3,     2);
		layer_conv(59,       16,       57,       32,        3,     1);
		layer_conv(57,       32,       55,       32,        3,     1);
	}
		
	for(int Rnet_cur_times=0;Rnet_cur_times<10;Rnet_cur_times++){
		printf("Rnet run times %d\n",Rnet_cur_times);
		//         inSize    inChannel outSize  outChannel kernel stride
		layer_conv(24+2,       3,        24,      24,        3,     1);
		layer_conv(24+1,       28,       12,      12,        3,     2);
		layer_conv(12+2,       28,       12,      12,        3,     1);
		layer_conv(12+1,       48,       6,       6,         3,     2);		
		layer_conv(6+1 ,       48,       3,       3,         3,     2);
	}
	
	for(int Onet_cur_times=0;Onet_cur_times<5;Onet_cur_times++){
		printf("Onet run times %d\n",Onet_cur_times);
		//         inSize    inChannel outSize  outChannel kernel stride
		layer_conv(48+2,     3,        48,      32,        3,     1);
		layer_conv(48+1,     32,       24,      32,        3,     2);
		layer_conv(24+2,     32,       24,      64,        3,     1);
		layer_conv(24+1,     64,       12,      64,        3,     2);		
		layer_conv(12+1,     64,       6,       128,       3,     2);
		layer_conv(6+1,      128,      3,       128,       3,     2);
	}

PS中实现卷积时间极长:

上面的网络参数运行一次将近100.94秒

 

 

 

 

 

 

 

你可能感兴趣的:(FPGA,MTCNN,机器学习)